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1  What  is  Distributed  Computing? 

In  the  term  distributed  computing,  the  word  distributed  means  spread  out 
across  space.  Thus,  distributed  computing  is  an  activity  performed  on  a  spa¬ 
tially  distributed  system.  Although  one  usually  speaks  of  a  distributed  sys¬ 
tem,  it  is  more  accurate  to  speak  of  a  distributed  view  of  a  system.  A  hard¬ 
ware  designer  views  an  ordinary  sequential  computer  as  a  distributed  system, 
since  its  components  axe  spread  across  several  circmt  boards,  while  a  Pas¬ 
cal  programmer  views  the  same  computer  as  nondistributed.  An  important 
problem  in  distributed  computing  is  to  provide  a  user  with  a  nondistributed 
view  of  a  distributed  system — for  example,  to  implement  a  distributed  file 
system  that  allows  the  client  programmer  to  ignore  the  physical  location  of 
his  data. 

We  use  the  term  model  to  denote  a  view  or  abstract  representation  of  a 
distributed  system.  We  will  describe  and  discuss  models  informally,  although 
we  do  present  formal  methods  that  can  be  used  to  reason  about  them. 

The  models  of  computation  generally  considered  to  be  distributed  are 
process  models,  in  which  computational  activity  is  represented  as  the  con¬ 
current  execution  of  sequential  processes.  Other  models,  such  as  Petri 
nets  [Thi85],  are  usually  not  studied  under  the  title  of  distributed  comput¬ 
ing,  even  though  they  may  be  used  to  model  spatially  distributed  systems. 
We  therefore  restrict  our  attention  to  process  models. 

Different  process  models  are  distinguished  by  the  mechanism  employed 
for  interprocess  communication.  The  process  models  that  are  most  obviously 
distributed  are  ones  in  which  processes  communicate  by  message  passing — 
a  process  sends  a  message  by  adding  it  to  a  message  queue,  and  another 
process  receives  the  message  by  removing  it  from  the  queue.  These  models 
vary  in  such  details  as  the  length  of  the  message  queues  and  how  long  a  delay 
may  occur  between  when  a  message  is  sent  and  when  it  can  be  received. 
There  are  two  significant  assumptions  embodied  in  message-passing  models: 

•  Message  passing  represents  the  dominant  cost  of  executing  an  algo¬ 
rithm. 

•  A  process  can  continue  to  operate  correctly  despite  the  failure  of  other 
processes. 

The  first  assumption  distinguishes  the  use  of  message  passing  in  distributed 
computing  from  its  use  as  a  synchronization  mechanism  in  nondistributed 
concurrent  computing.  The  second  assumption  characterizes  the  important 
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subfield  of  fault-tolerant  computing.  Some  degree  of  fault  tolerance  is  re¬ 
quired  of  most  real  distributed  systems,  but  one  often  studies  distributed 
algorithms  that  are  not  fault  tolerant,  leaving  other  mechanisms  (such  as 
interrupting  the  algorithm)  to  cope  with  failures. 

Other  process  models  are  considered  to  be  distributed  if  their  interpro¬ 
cess  communication  mechanisms  can  be  implemented  efficiently  enough  by 
message  passing,  where  efficiency  is  measured  by  the  message  passing  costs 
incurred  in  axrhieving  a  reasonable  degree  of  fault-tolerance.  Algorithms  ex¬ 
ist  for  implementing  virtually  any  process  model  by  a  message  passing  model 
with  any  desired  degree  of  fault  tolerance.  Whether  an  implementation  is 
efficient  enough,  and  what  constitutes  a  reasonable  degree  of  fault  toler¬ 
ance  are  matters  of  judgement,  so  there  is  no  consensus  on  what  models  are 
distributed. 

2  Models  of  Distributed  Systems 

2.1  Message^ Passing  Models 

2.1.1  Taxonomy 

A  wide  variety  of  message-passing  models  can  be  used  to  represent  dis¬ 
tributed  systems.  They  can  be  classified  by  the  assumptions  they  malce 
about  four  separate  concerns:  network  topology,  synchrony,  failure,  and 
message  buffering.  Different  models  do  not  necessarily  represent  different 
systems;  they  may  be  different  views  of  the  seune  system.  An  algorithm  for 
implementing  (or  simulating)  one  model  with  another  provides  a  mechanism 
for  implementing  one  view  of  a  system  with  a  lower-level  view.  The  entire 
goal  of  system  design  is  to  implement  a  simple  and  powerful  user-level  view 
with  the  lower-level  view  provided  by  the  hardware. 

Network  Topology  The  network  topology  describes  which  processes  can 
send  messages  directly  to  which  other  processes.  The  topology  is  described 
by  a  communication  graph  whose  nodes  are  the  processes,  and  where  an 
arc  from  process  i  to  process  j  denotes  that  i  can  send  messages  directly 
to  j.  Most  models  assume  an  undirected  graph,  where  an  arc  joining  two 
processes  means  that  each  can  send  messages  to  the  other.  However,  one 
can  also  consider  directed  graph  models  in  which  there  can  be  an  arc  from 
t  to  j  without  one  from  j  to  i,  so  i  can  send  messages  to  j  but  not  vice 
versa.  We  use  the  term  link  to  denote  an  arc  in  the  communication  graph; 
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a  message  sent  directly  from  one  process  to  another  is  said  to  be  sent  over 
the  link  joining  the  two  processes. 

In  some  models,  each  process  is  assumed  to  know  the  complete  set  of  pro¬ 
cesses,  and  in  others  a  process  is  assumed  to  have  only  partial  knowledge — 
usually  the  identity  of  its  immediate  neighbors.  The  simplest  models,  em¬ 
bodying  the  strongest  assumptions,  are  ones  with  a  completely  connected 
communication  graph,  where  each  nonfaulty  process  knows  about  and  can 
send  messages  directly  to  every  other  nonfaulty  process.  Routing  algorithms 
are  used  to  implement  such  a  model  with  a  weaker  one. 

Synchrony  In  the  following  discussion,  all  synchrony  conditions  are  as¬ 
sumed  to  apply  only  in  the  absence  of  failure.  Failure  assumptions  are 
treated  separately  below. 

A  completely  asynchmnotis  model  is  one  with  no  concept  of  real  time.  It 
is  assumed  that  messages  are  eventually  delivered  and  processes  eventually 
respond,  but  no  assumption  is  made  about  how  long  it  may  take. 

Other  models  introduce  the  concept  of  time  and  assume  known  upper 
bounds  on  message  transmission  time  and  process  response  time.  For  sim¬ 
plicity,  in  our  examples  we  will  use  the  simplest  form  of  this  assumption, 
that  a  message  generated  in  response  to  an  event  at  any  time  t  (such  as  the 
receipt  of  another  message)  arrives  at  its  destination  by  time  t  +  6,  where  6 
is  a  known  constant. 

Processes  need  some  form  of  real-time  clock  to  take  advantage  of  this 
assumption.  The  simplest  type  of  clock  is  a  timer,  which  measures  elapsed 
time;  the  instantaneous  values  of  different  processes’  timers  are  independent 
of  one  another.  Timers  are  used  to  detect  failure,  the  assumption  made 
above  implying  that  a  failure  must  have  occurred  if  the  reply  to  a  message 
is  not  received  within  26  seconds  of  the  sending  of  that  message. 

Some  models  make  the  stronger  assumption  that  processes  have  syn¬ 
chronized  clocks  that  run  at  approximately  the  correct  rate  of  one  second 
of  clock  time  per  second  of  real  time.  The  simplest  such  assumption,  which 
we  use  in  our  discussion,  is  that  at  each  instant,  the  clocks  of  any  two  pro¬ 
cesses  differ  by  at  most  c  for  some  known  constant  c.  Algorithms  can  use 
synchronized  clocks  to  reduce  the  number  of  messages  that  need  to  be  sent. 
For  example,  if  a  process  is  supposed  to  send  a  message  at  a  known  time 
t,  then  the  receiving  process  knows  that  there  must  have  been  a  failure  if 
the  message  did  not  arrive  by  approximately  time  t  +  6  -f  6  on  its  clock — 
the  6  due  to  delivery  time  and  the  c  due  to  the  difference  between  the  two 
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processes'  clocks.  Thus,  one  can  test  for  failure  by  sending  a  single  message 
rather  than  the  query  and  response  required  with  only  timers.  It  appears 
to  be  a  fundamental  property  of  distributed  systems  that  algorithms  which 
depend  upon  synchronized  clocks  incur  a  delay  proportional  to  the  bound 
on  clock  differences  (taken  to  be  c  in  our  discussion). 

Given  a  bound  on  the  ratio  of  the  running  rates  of  any  two  processes’ 
timers,  and  the  assumed  bound  on  message  and  processing  delays,  algo¬ 
rithms  exist  for  constructing  synchronized  clocks  from  timers.  These  algo¬ 
rithms  are  discussed  later. 

The  most  strongly  synchronous  model  is  one  in  which  the  entire  com¬ 
putation  proceeds  in  a  sequence  of  distinct  rounds.  At  each  round,  every 
process  sends  messages,  possibly  to  every  other  process,  based  upon  the 
messages  that  it  received  in  previous  rounds.  Thus,  the  processes  act  like 
processors  in  a  single  synchronous  computer.  This  model  is  easily  simulated 
using  synchronized  clocks  by  letting  each  round  begin  S  +  e  seconds  after 
the  preceding  one. 

Failure  In  message-passing  models,  one  can  consider  both  process  failures 
and  communication  failures.  It  is  commonly  assumed  that  communication 
failure  can  result  only  in  lost  messages,  although  duplication  of  messages  is 
sometimes  allowed.  Models  in  which  incorrect  messages  may  be  delivered  are 
seldom  studied  because  it  is  believed  that  in  practice,  the  use  of  redundant 
information  (checksums)  allows  the  system  to  detect  garbled  messages  and 
discard  them. 

Models  may  allow  transient  errors  that  destroy  individual  messages,  or 
they  may  consider  only  failures  of  individual  links.  A  link  failure  may  cause 
all  messages  sent  over  the  link  to  be  lost  or,  in  a  model  with  timers  or  clocks, 
a  failed  link  may  deliver  messages  too  late.  Since  algorithms  that  use  timers 
or  clocks  usually  discard  late  messages,  there  is  little  use  in  distinguishing 
between  late  and  lost  messages.  Of  particular  concern  in  considering  link 
failures  is  whether  or  not  one  considers  the  possibility  of  network  partition, 
where  the  communication  graph  becomes  disconnected,  making  it  impossible 
for  some  pairs  of  nodes  to  communicate  with  each  other. 

The  weakest  assumption  made  about  process  failure  is  that  failure  of  one 
process  cannot  affect  communication  over  a  link  joining  two  other  processes, 
but  any  other  behavior  by  the  fsuled  process  is  possible.  Such  models  are 
said  to  allow  Byzantine  failure. 

More  restrictive  models  permit  only  omission  failures,  in  which  a  faulty 
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process  fails  to  send  some  messages.  (Since  late  messages  are  usually  dis¬ 
carded,  failures  that  cause  a  process  to  send  messages  too  late  can  be  con¬ 
sidered  omission  failures.) 

The  most  restrictive  models  allow  only  halting  failures,  in  which  a  failed 
process  does  nothing.  In  the  subclass  of  fail-stop  models,  other  processes 
know  when  a  process  has  failed  [SA86]. 

In  addition  to  the  actual  failure  mode,  some  models  make  assumptions 
about  how  a  failed  process  may  be  restarted.  Models  that  allow  only  halting 
failures  often  assume  some  form  of  stable  storage  that  is  not  affected  by  a 
failure.  A  fsdled  process  is  restarted  with  its  stable  storage  in  the  same  state 
as  before  the  failure  and  with  every  other  pairt  of  its  state  restored  to  some 
initial  values. 

f^lure  models  are  problematic  because  it  is  difficult  to  determine  how 
accurately  they  describe  the  behavior  of  real  systems.  It  seems  to  be  a 
widely  held  view  among  implementers  of  distributed  systems  that  message 
loss  and  link  failure  adequately  represent  intercomputer  communication  fail¬ 
ures.  Whether  or  not  a  particular  model  of  process  failure  is  suitable  depends 
upon  the  degree  of  reliability  one  requires  of  the  system.  There  is  general 
agreement  that  halting  failure  represents  the  most  common  type  of  com¬ 
puter  failure — the  familiar  “system  crash”.  It  seems  to  provide  a  suitable 
model  when  only  modest  reliability  is  required.  Omission  faults,  caused  by 
unusual  demand  slowing  down  a  computer’s  response  time,  should  probably 
be  considered  when  greater  reliability  is  required.  When  extremely  high 
reliability  is  required — especially  when  failure  of  the  entire  system  could  be 
life  threatening — ^it  seems  necessary  to  assume  Byzantine  failures. 

As  we  describe  later,  algorithms  that  tolerate  Byzantine  failures  are  more 
costly  than  ones  that  tolerate  only  more  restricted  fmlures.  Less  costly 
algorithms  can  be  achieved  by  strengthening  Byzantine  failure  models  to 
allow  digital  signatures  [DH79].  It  is  assumed  that  given  an  arbitrary  data 
item  Z?,  any  nonfaulty  process  i  can  generate  a  digital  signature  5(t,  D)  such 
that  any  other  process  can  determine  whether  a  particular  value  v  equals 
S{iiD)  for  a  given  D,  but  no  other  process  can  generate  Although 

digital  signatures  are  a  cryptographic  concept,  in  practical  fault-tolerant 
algorithms  they  are  implemented  with  redundancy.  It  is  believed  that,  by 
the  careful  use  of  redundancy,  the  assumption  made  about  digital  signatures 
can  be  achieved  with  high  enough  probability  to  allow  the  use  of  the  model 
even  when  extremely  high  reliability  is  required. 
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Message  Buffering  In  message-passing  models,  there  is  a  delay  between 
when  a  message  is  sent  and  when  it  is  received.  Such  a  delay  implies  that 
there  is  some  form  of  message  buffering.  Models  may  assume  either  finite 
or  infinite  buffers.  With  finite  buffers,  any  link  may  contain  only  a  fixed 
maximum  number  of  messages  that  have  been  sent  over  that  link  but  not 
yet  recdved.  When  the  link’s  buffer  is  full,  attempts  to  send  an  additional 
message  over  the  link  either  fail  and  produce  some  error  response  to  the 
sending  process  or  else  cause  the  sending  process  to  wait  until  there  is  room 
in  the  buffer.  With  infinite  buffers,  there  may  be  s^rbitrarily  many  unreceived 
messages  in  a  link’s  buffer,  and  the  sender  can  always  send  another  message 
over  the  link.  Although  any  real  system  has  a  finite  capacity,  this  capacity 
may  be  large  enough  to  make  infinite  buffering  a  reasonable  abstraction. 

If  a  link’s  buffer  can  hold  more  than  one  mess^e,  it  is  possible  for 
messages  to  be  received  in  a  different  order  than  they  were  sent.  Models  with 
FIFO  (first-in-first-oat)  buffering  assume  that  messages  that  are  not  lost  are 
always  recdved  in  the  same  order  in  which  they  were  sent.  Mamy  algorithms 
for  asynchronous  systems  work  only  under  the  assumption  of  FIFO  buffering. 
In  most  algorithms  for  systems  with  timers  or  synchronized  clocks,  a  process 
does  not  send  a  message  to  another  process  until  it  knows  that  the  previous 
message  to  that  process  has  dther  been  delivered  or  lost,  so  FIFO  buffering 
need  not  be  assumed.  At  the  lowest  level,  real  distributed  systems  usually 
provide  FIFO  buffering.  This  need  not  be  the  case  at  higher  levels,  where 
messages  may  be  routed  to  their  destination  along  multiple  possible  paths. 
However,  if  it  is  not  provided  by  the  underlying  communication  mechanism, 
FIFO  buffering  can  be  implemented  by  numbering  the  messages. 

2.1.2  Measuring  Complexity 

There  are  two  basic  complexity  measures  for  distributed  algorithms:  time 
and  message  complexity.  The  time  complexity  of  an  algorithm  measures 
the  time  needed  both  for  message  transmission  and  for  computation  within 
the  processes.  However,  computations  performed  by  individual  processes 
are  trsiditionally  ignored,  only  message-passing  time  being  counted.  This  is 
a  reasonable  approximation  for  current  computer  networks  in  which  mes¬ 
sage  delivery  time  is  usually  several  milliseconds  or  more,  while  computer 
operations  are  measured  in  microseconds.  However,  a  millisecond  is  only 
a  thousand  microseconds,  and  a  practical  algorithm  should  not  perform 
millions  of  extra  calculations  to  save  a  few  messages.  Moreover,  the  large 
difference  between  message  delivery  time  and  processing  time  should  not 
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be  taJcen  for  granted.  Although  it  takes  much  longer  for  electromagnetic 
signals  to  travel  within  a  processor  than  between  processors  in  a  spatially 
distributed  system,  current  processing  speed  is  limited  primarily  by  circuit 
delays  rather  than  transmission  speed.  With  current  technology,  the  high 
cost  of  sending  a  message  is  an  artifact  of  the  way  systems  are  designed, 
since  electrical  signals  can  travel  a  kilometer  in  a  few  microseconds. 

The  usual  measure  of  message-passing  time  for  an  algorithm  is  the  length 
of  the  longest  chain  of  messages  that  occurs  before  the  algorithm  terminates, 
where  each  message  in  the  chain  except  the  first  is  generated  by  the  receipt 
of  the  previous  one.  For  completely  asynchronous  models,  where  no  as¬ 
sumptions  are  made  about  message  delivery  times,  this  seems  to  be  the  only 
reasonable  way  to  measure  worst-case  message-passing  time;  for  synchronous 
models  that  operate  in  rounds,  it  is  just  the  number  of  rounds.  The  mear 
sure  can  be  refined  to  take  account  of  more  precise  timing  assumptions — for 
example,  if  transmission  delays  are  different  for  different  links.  Of  course, 
processing  time  should  be  included  in  the  time  complexity  if  it  is  significant. 

The  most  common  measure  of  message  complexity  is  the  total  number 
of  messages  transmitted.  If  messages  contain  on  the  order  of  a  few  hundred 
bits  or  more,  then  the  total  number  of  bits  sent  might  be  a  better  measure 
of  the  cost  than  the  number  of  messages.  In  many  algorithms,  a  process 
broadcasts  the  same  message  to  n  other  processes.  Depending  upon  the 
implementation  details  of  the  system,  such  a  broadcast  might  cost  as  much 
as  sending  n  separate  messages,  or  it  might  cost  no  more  than  sending  a 
single  message. 

Tradeoffs  between  time  and  message  complexity  are  often  possible.  The 
minimal-time  algorithm  is  usually  simple,  with  more  complex  algorithms 
saving  messages,  but  taking  longer  to  terminate.  It  is  often  possible  to 
“improve”  algorithms  by  reducing  their  message  complexity  at  the  expense 
of  their  time  complexity.  However,  many  distributed  systems  contain  few 
enough  processes  that  an  algorithm  with  a  message  complexity  proportional 
to  the  square  of  the  number  of  processes  is  quite  practical  and  is  often  better 
than  a  more  complicated  one  that  uses  fewer  messages  but  takes  longer. 

As  with  sequential  algorithms,  there  is  also  the  question  of  whether  to 
measure  worst-case  or  average  behavior — for  example,  whether  to  measure 
the  maximum  number  of  messages  that  can  be  sent  or  the  expected  number 
(in  the  sense  of  probability  theory).  When  high  reliability  is  required,  worst- 
case  behavior  is  usually  the  appropriate  measure.  In  other  cases,  the  average 
cost  may  be  more  important.  Average  costs  have  been  derived  mainly  for 
probabilistic  algorithms,  in  which  processes  make  random  choices. 
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2.2  Other  Models 


Other  models  of  concurrent  systems  are  usually  described  in  terms  of  lan¬ 
guage  constructs  for  interprocess  communication.  This  can  lead  to  the  con¬ 
fusion  of  underlying  concepts  (what  one  says)  with  language  issues  (how  one 
says  it),  but  we  know  of  no  simple  alternative  for  classifying  the  standard 
models. 


2.2.1  Shared  Variables 

In  the  earliest  models  of  concurrency,  processes  communicate  through  global 
shared  variables — ^program  variables  that  can  be  read  and  written  by  all  pro¬ 
cesses.  Initially,  the  shared  variables  were  accessed  by  the  ordinary  program 
operations  of  expression  evaluation  and  assignment;  later  vairiations  included 
synchronization  primitives  such  as  semaphores  [Dij68]  and  monitors  [Hoa74] 
to  control  access  to  shared  variables.  Global  shared  variable  models  provide 
a  natural  representation  of  multiprocessing  on  a  single  computer  with  one 
or  more  processors  connected  to  a  central  shared  memory. 

The  most  natural  and  efficient  way  to  implement  global  shared  variables 
with  message  passing  is  to  have  each  shared  variable  maintained  by  a  sin¬ 
gle  process.  That  process  can  access  the  variable  locally;  it  requires  two 
messages  for  another  process  to  read  or  write  the  variable.  A  read  requires 
a  query  and  a  response  with  the  value;  a  write  requires  sending  the  new 
value  and  receiving  an  acknowledgement  that  the  operation  was  done — the 
acknowledgement  is  required  because  the  correctness  of  shared-variable  al¬ 
gorithms  depends  upon  the  assumption  that  a  write  is  completed  before  the 
next  operation  is  begun. 

Such  an  implementation  of  global  shared  variables  is  not  at  all  fault 
tolerant,  since  failure  of  the  process  holding  the  variable  blocks  the  progress 
of  any  other  process  that  accesses  it.  A  fault-tolerant  implementation  must 
maintain  multiple  copies  of  the  variable  at  different  processes,  which  requires 
much  more  message  passing.  Hence,  global  shared  variable  models  atre  not 
generally  considered  to  be  distributed. 

A  more  restrictive  class  of  models  permits  interprocess  communication 
only  through  local  shared  variables,  which  au'e  shared  variables  that  are 
“owned”  by  individual  processes.  A  local  shared  variable  can  be  read  by 
multiple  processes,  but  it  can  be  written  only  by  the  process  that  owns  it. 
Reading  a  variable  owned  by  a  failed  process  is  assumed  to  return  some 
default  value. 
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2.2.2  Synchronous  Communication 

Synchronous  communication  was  introduced  by  Hoare  in  his  Communicating 
Sequential  Processes  (CSP)  language  [Hoa78].  In  CSP,  process  i  sends  a 
value  V  to  process  j  by  executing  the  output  command  j\v,  process  j  receives 
that  value,  assigning  it  to  variable  a:,  by  executing  the  input  command  tlx. 
Unlike  the  case  of  ordinary  message  passing,  the  input  and  output  commands 
are  executed  synchronously.  Execution  of  a  j\v  operation  is  delayed  until 
process  t  is  ready  to  execute  an  t?*  operation,  and  vice  versa.  Thus,  a 
CSP  communication  operation  waits  until  a  corresponding  communication 
operation  can  be  executed  in  another  process. 

There  is  am  obvious  way  to  implement  synchronous  communication  with 
message  passing.  Process  i  begins  execution  of  a  j\v  command  by  sending 
a  message  to  j  with  the  value  u;  when  process  j  is  ready  to  execute  the 
corresponding  i7x  command,  it  sends  an  acknowledgement  message  to  t  and 
proceeds  to  its  next  operation.  Process  i  can  continue  its  execution  when  it 
receives  the  acknowledgement. 

Many  concurrent  algorithms  require  that  a  process  be  prepared  to  com¬ 
municate  with  any  one  of  several  processes,  but  actually  communicate  with 
only  one  of  them  before  doing  some  further  processing.  With  synchronous 
communication  primitives,  this  means  that  a  process  must  be  prepared  to 
execute  any  one  of  a  set  of  input  and/or  output  commands.  If  each  process 
could  be  wsuting  for  an  arbitrary  set  of  communication  commands,  then 
deciding  which  communications  should  occur  could  require  a  complicated 
distributed  algorithm.  For  example,  consider  a  network  of  three  processes, 
each  of  which  is  ready  to  communicate  with  either  one  of  the  other  two.  Any 
pair  of  them  can  execute  their  corresponding  communication  actions,  but 
only  one  pair  may  do  so,  and  deciding  upon  that  pair  requires  a  distributed 
algorithm.  To  get  around  this  difficulty,  CSP  allows  a  process  to  wait  for  an 
arbitrary  set  of  input  commands,  but  it  may  not  be  waiting  for  any  other 
communication  if  it  is  ready  to  perform  an  output  command.  The  choice 
of  which  communication  to  perform  can  then  be  made  within  a  process,  so 
each  communication  action  requires  only  two  messages. 

Although  CSP  allows  an  efficient  implementation  with  message  passing, 
it  does  not  permit  fault  tolerant  algorithms.  A  process  i  that  is  waiting  to 
execute  a  jlv  command  cannot  continue  unless  process  j  executes  a  corre¬ 
sponding  t?z  command.  The  failure  of  process  j  therefore  halts  the  execution 
of  process  t.  (This  could  be  avoided  if  t  could  wait  to  communicate  with 
any  one  of  several  processes,  which  CSP  prohibits.)  Despite  this  difficulty. 
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CSP  is  often  considered  a  distributed  model. 

Closely  related  to  synchronous  communication  is  the  remote  procedure 
call  or  rendezvous.  A  remote  procedure  call  is  executed  just  like  an  ordinsj-y 
procedure  call,  except  the  procedure  is  executed  in  another  process.  It  can 
be  implemented  with  two  messages:  one  to  send  the  arguments  of  the  call  in 
one  message  and  another  to  return  the  result.  Halting  and  omission  failures 
can  be  handled  by  having  the  procedure  call  return  an  error  result  or  raise  an 
exception  if  no  response  to  the  first  message  is  received.  Remote  procedure 
call  is  currently  the  most  widely  used  language  construct  for  implementing 
distributed  systems  without  explicit  message-passing  operations. 

2.3  FVindamental  Concepts 

The  theory  of  sequential  computing  rests  upon  fundamental  concepts  of 
computability  that  are  independent  of  any  particular  computational  model. 
If  there  are  any  such  fundamental  formal  concepts  underlying  distributed 
computing,  they  have  yet  to  be  developed.  At  present,  the  field  seems  to 
consist  of  a  collection  of  largely  unrelated  results  about  individual  models. 
Nevertheless,  one  can  make  some  informal  observations  that  seem  to  be 
important. 

Underlying  almost  all  models  of  concurrent  systems  is  the  assumption 
that  an  execution  consists  of  a  set  of  discrete  events,  each  affecting  only  part 
of  the  system’s  state.  Events  are  grouped  into  processes,  each  process  being 
a  more  or  less  completely  sequenced  set  of  events  sharing  some  common 
locality  in  terms  of  what  part  of  the  state  they  affect.  For  a  collection  of 
autonomous  processes  to  act  as  a  coherent  system,  the  processes  must  be 
synchronized. 

FVom  the  original  work  on  concurrent  process  synchronization  emerged 
two  distinct  classes  of  synchronization  problem:  contention  and  cooperation. 
The  archetypical  contention  problem  is  the  mutual  exclusion  problem,  in 
which  each  process  has  a  critical  section  and  processes  must  be  synchronized 
so  that  no  two  of  them  execute  their  critical  section  at  the  same  time  [Dij65]. 
As  originally  stated,  this  problem  includes  the  requirement  that  a  process  be 
allowed  to  halt  when  not  executing  its  critical  section  or  its  synchronization 
protocol.  With  this  requirement,  solutions  are  possible  in  shared-variable 
models  but  not  in  asynchronous  message-passing  models,  which  require  that 
a  process  receive  a  message  from  every  other  process  before  it  can  enter 
its  critical  section.  However,  the  mutual  exclusion  problem  without  this 
requirement  has  been  studied  in  asynchronous  message-passing  systems. 
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The  classic  problem  in  cooperation  is  the  bounded  buffer  problem,  in 
which  an  unbounded  sequence  of  values  are  transmitted  in  order  from  a 
sender  process  to  a  receiver  process,  using  a  fixed-length  array  of  registers 
as  a  buffer.  The  receiver  must  wadt  when  the  buffer  is  empty,  and  the 
sender  must  wait  when  the  buffer  is  full.  This  problem  is  best  viewed  as  a 
symmetrical  one,  in  which  the  sender  generates  filled  buffer  elements  for  use 
by  the  receiver  and  the  receiver  generates  empty  buffer  elements  for  use  by 
the  sender. 

The  fundamental  difference  between  these  two  forms  of  synchronization 
is  that  in  contention  problems  a  process  must  be  able  to  make  unlimited 
progress  even  if  other  processes  fail  to  progress,  while  in  cooperation  prob¬ 
lems  the  progress  of  one  process  depends  upon  the  progress  of  another.  For 
example,  in  the  mutual  exclusion  problem,  a  process  may  enter  its  crit¬ 
ical  section  an  unlimited  number  of  times  while  other  processes  are  not 
requesting  entrance,  but  in  the  bounded  buffer  problem,  after  the  producer 
has  filled  the  buffer  it  cannot  proceed  until  the  consumer  creates  an  empty 
buffer  element. 

Problems  of  contention  and  cooperation  appear  in  all  models  of  concur¬ 
rency.  A  class  of  problem  that  has  arisen  in  the  study  of  message-passing 
models  is  that  of  global  consistency.  For  example,  in  a  distributed  banking 
system,  one  would  like  all  branches  of  the  bank  to  have  a  consistent  view 
of  the  balance  of  any  single  account.  In  general,  one  would  like  to  describe 
a  distributed  system  in  terms  of  its  current  global  state.  The  global  con¬ 
sistency  problem  is  to  ensure  that  all  processes  have  a  consistent  view  of 
the  state.  In  the  banking  example,  the  amount  of  money  currently  in  each 
account  is  part  of  the  state. 

To  define  a  global  state,  there  must  be  a  total  ordering  of  all  transac¬ 
tions — to  determine  if  there  is  enough  money  in  my  account  for  a  withdrawal 
request  to  be  granted,  one  must  know  if  a  deposit  action  occurred  before  or 
after  the  request.  In  an  asynchronous  message-passing  model,  there  is  no 
natural  total  ordering  of  events,  only  the  partial  ordering  among  events  de¬ 
fined  by  letting  event  a  precede  event  6  if  there  is  information  flow  permitting 
a  to  affect  6.  The  definition  of  a  global  state  requires  completing  the  par¬ 
tial  ordering  of  events,  defined  by  the  causality  relation,  to  a  total  ordering. 
Achieving  global  consistency  can  be  reduced  to  the  problem  of  guaranteeing 
that  all  processes  choose  the  same  total  ordering  of  events,  thereby  having 
the  same  definition  of  the  global  system  state.  One  method  of  achieving  this 
common  total  ordering  is  through  the  use  of  logical  clocks  [Lam78].  A  logical 
clock  is  a  counter  maintained  by  each  process  with  the  property  that  if  event 
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a  precedes  event  6,  then  the  time  of  event  a  precedes  the  time  of  event  6, 
where  the  time  of  an  event  is  measured  on  the  logical  clock  of  the  process 
at  which  the  event  occurred.  Lo^cal  clocks  are  implemented  by  attaching 
a  timestamp,  containing  the  current  value  of  the  sender’s  logical  clock,  to 
each  message. 

Because  there  is  no  unique  definition  of  a  global  state  in  a  messs^e- 
passing  model,  it  is  sometimes  mistakenly  argued  that  one  should  not  use 
the  global  state  in  reasoning  about  such  models.  The  absence  of  a  unique 
definition  of  the  global  state  does  not  mean  that  we  cannot  reason  in  terms 
of  an  arbitrarily  chosen  definition.  The  method  of  reasoning  we  describe 
below,  which  involves  reasoning  about  the  state  of  a  system,  is  useful  for  all 
concurrent  models,  including  message-passing  ones. 

Another  way  of  viewing  the  global  consistency  problem  is  in  terms  of 
knowledge.  The  problem  exists  because  it  is  impossible  for  a  process  to 
know  the  current  global  state,  since  the  concurrent  activity  of  other  pro¬ 
cesses  can  render  its  knowledge  obsolete.  It  is  rather  natural  to  think  about 
distributed  algorithms  in  terms  of  what  each  process  knows,  and  reasoning 
about  the  limitations  on  a  process’s  knowledge  forms  the  basis  for  proofs  of 
many  of  the  impossibility  results  described  below.  However,  only  recently 
has  there  been  an  attempt  to  perform  this  reasoning  within  formal  theories 
of  knowledge.  [HM84].  These  theories  of  knowledge  provide  a  promising 
approach  to  a  fundamental  theory  of  distributed  processing,  but,  at  this 
writing,  it  is  too  early  to  know  how  successful  they  will  prove  to  be. 

3  Reasoning  About  Distributed  Algorithms 

Concurrent  algorithms  can  be  deceptive;  an  algorithm  that  looks  simple 
may  be  quite  complex,  allowing  unanticipated  behavior.  Rigorous  reasoning 
is  necessary  to  determine  if  an  algorithm  does  what  it  is  supposed  to,  and 
rigorous  reasoning  requires  a  formal  foundation. 

Here,  we  discuss  verification — proving  properties  of  concurrent  algo¬ 
rithms.  In  verification,  the  properties  to  be  proved  are  stated  in  terms  of  the 
algorithm  itself — that  is,  in  terms  of  the  algorithm’s  variables  and  actions. 
The  related  field  of  specification,  in  which  the  properties  to  be  satisfied  are 
expressed  in  higher-level,  implementation-independent  terms,  is  considered 
briefly  in  Section  3.6.  Specification  methods  must  deal  with  the  subtle  ques¬ 
tion  of  what  it  means  for  a  lower-level  algorithm  to  implement  a  higher-level 
description.  This  question  does  not  arise  in  the  verification  methods  that 
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we  discuss,  since  the  description  of  the  algorithm  and  the  properties  to  be 
proved  are  expressed  in  terms  of  the  same  objects. 

3.1  A  System  as  a  Set  of  Behaviors 

We  have  already  seen  that  there  are  a  wide  variety  of  computational  models 
of  concurrent  systems.  However,  they  can  almost  all  be  described  in  terms  of 
a  single  formal  model,  which  forms  the  basis  for  our  discussion  of  verification. 
In  this  model,  we  represent  a  concurrent  system  by  a  triple  consisting  of  a 
set  S  of  states,  a  set  A  of  actions,  and  a  set  E  of  behaviors,  each  behavior 
being  a  finite  or  infinite  sequence  of  the  form 

So  ^  ^  ^2  •  •  •  (1) 

where  each  s,-  is  a  state  and  each  a,  is  an  action.  (If  the  sequence  is  finite, 
then  it  ends  with  a  state  Sn.)  A  state  describes  the  complete  instantaneous 
state  of  the  system,  an  action  is  a  system  operation  that  is  taken  to  be 
indivisible,  and  a  behavior  represents  an  execution  of  the  system  whose 
action  a,-  takes  the  system  from  state  Si-i  to  state  Sj.  The  set  E  represents 
the  set  of  all  possible  system  executions. 

Most  verification  methods  regard  a  behavior  as  either  a  sequence  of  states 
or  a  sequence  of  actions.  Having  states  and  actions  in  a  behavior  allows  our 
discussion  to  apply  to  both  approaches. 

To  reason  about  a  system,  one  must  first  describe  the  triple  S,  A,  E  that 
represents  it — for  example,  by  a  program  in  some  programming  language. 
Properties  of  the  system  are  expressed  by  assertions  about  the  set  E.  Here 
au'e  three  examples  to  indicate,  very  informally,  how  this  is  done. 

mutual  exclusion  For  every  state  s,-  of  every  behavior  of  E,  in  s,  there  is 
at  most  one  process  in  its  critical  section.  (For  a  state  to  be  a  complete 
description  of  the  instantaneous  state  of  the  system,  it  must  describe 
which  processes  are  in  their  critical  section.) 

lockout-freedom:  (This  property  asserts  that  a  process  that  wants  to  enter 
its  critical  section  eventually  does  so.)  For  every  behavior  of  the  form 
(1)  in  E  and  every  i  >  0,  if  s,-  is  a  state  in  which  a  process  is  requesting 
entry  to  its  critical  section,  then  there  is  some  j  >  i  such  that  sj  is  a 
state  in  which  that  process  is  in  its  critical  section. 

bounded  message  delay:  If  a,-  is  the  action  of  sending  a  message,  s,--i  is 
a  state  in  which  the  time  is  T,  and  Sj  is  a  state  in  which  the  time  is 
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greater  than  T  +  £,  then  there  is  a  ^  with  i  <  k  <  j  such  that  ajt  is 
the  action  of  receiving  that  message.  (This  assumes  that  the  current 
time  is  part  of  the  state.) 

3.2  Safety  and  Liveness 

Any  model  is  an  abstraction  that  represents  only  some  aspects  of  the  system, 
and  the  choice  of  model  restricts  the  class  of  properties  one  can  reason  about. 
Most  formal  reasoning  about  concurrent  systems  has  been  aimed  at  proving 
two  kinds  of  properties:  safety  and  liveness.  Intuitively,  a  safety  property 
asserts  that  something  bad  does  not  happen,  and  a  liveness  property  asserts 
that  something  good  eventually  does  happen. 

In  sequential  computing,  the  most  commonly  studied  safety  property  is 
partial  correctness — if  the  program  is  started  with  correct  input,  then  it 
does  not  terminate  with  the  wrong  answer,  and  the  most  commonly  studied 
liveness  property  is  termination — the  program  eventually  terminates.  A 
richer  variety  of  safety  and  liveness  properties  are  studied  in  concurrent 
computing;  for  example,  mutual  exclusion  and  bounded  message  delay  are 
safety  properties  and  lockout-freedom  is  a  liveness  property. 

There  are  other  classes  of  properties  besides  safety  and  liveness  that  axe 
of  interest — ^for  example,  the  assertion  that  there  is  a  .99  probability  that  the 
transmission  delay  is  less  than  S  is  nether  a  safety  nor  a  liveness  property. 
However,  safety  and  liveness  are  the  major  classes  of  properties  for  which 
there  are  well  developed  methods  of  formal  reasoning,  so  we  will  restrict  our 
attention  to  them. 

A  safety  or  liveness  property  is  an  assertion  about  an  individual  behavior. 
It  is  satisfied  by  the  system  if  it  is  true  for  all  behaviors  in  E.  A  safety 
property  is  one  that  is  false  for  a  behavior  if  and  only  if  it  is  false  for 
some  finite  initial  prefix  of  the  behavior.  (Intuitively,  if  something  bad 
happens,  then  it  happens  after  some  finite  number  of  actions.)  A  liveness 
property  is  one  in  which  smy  finite  behavior  can  be  extended  to  a  finite  or 
infinite  behavior  (not  necessarily  a  behavior  of  the  program)  that  satisfies 
the  property  (AS85].  (Intuitively,  after  any  finite  portion  of  the  behavior,  it 
must  still  be  possible  for  a  good  thing  to  happen.) 

3.3  Describing  a  System 

To  give  a  formal  description  of  a  system,  one  must  define  the  sets  of  states  S, 
actions  A,  and  behaviors  E.  A  state  is  defined  to  be  an  assignment  of  values 
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to  some  set  of  variables,  where  the  variables  may  include  ordinary  program 
variables,  message  buffers,  “program  counters”,  and  whatever  else  is  needed 
to  describe  completely  the  instantaneous  state  of  the  computation.  The  set 
of  actions  is  usually  explicitly  enumerated — for  example,  it  may  include  all 
actions  of  the  form  t  sends  m  to  j  for  particular  processes  t  and  j  and  a 
particular  message  m.  Actions  represent  internal  operations  of  the  system 
as  well  as  input  amd  output  operations. 

There  are  two  general  approaches  to  describing  the  set  S.  They  may  be 
called  the  constructive  and  axiomatic  approaches,  though  we  shall  see  that 
these  names  aire  misleading.  In  the  constructive  approach,  one  describes 
S  by  a  program,  where  £  is  defined  to  be  the  set  of  all  possible  behaviors 
obtained  by  executing  the  program.  The  program  may  be  written  in  a 
conventional  programming  lamguage,  or  in  terms  of  a  formal  model  such 
as  I/O  automata  [LT87]  or  Unity  [CM88].  In  the  aixiomatic  approach,  one 
describes  £  by  a  set  of  a^doms,  where  £  is  defined  to  be  the  set  of  all 
sequences  of  the  form  of  formula  (1)  that  satisfy  the  axioms.  The  axioms  may 
be  written  in  a  formal  system — some  form  of  temporal  logic  [Erne,  Pnu77] 
being  a  currently  populau-  choice — or  in  a  less  formal  mathematical  notation. 

Axiomatic  descriptions  lead  directly  to  a  method  of  reasoning.  If  S  is 
the  set  of  axioms  that  describe  £,  and  C  is  a  property  expressed  in  the 
same  formal  system  as  S,  then  the  system  satisfies  C  if  and  only  if  the 
formula  <5  I-  C  is  valid.  On  the  other  hand,  constructive  descriptions  are 
often  more  convenient  than  axiomatic  ones,  since  programming  languages 
are  designed  espedally  for  describing  computations  while  formal  systems  are 
usually  chosen  for  their  logical  properties. 

In  a  constructive  description,  one  specifies  the  possible  state  transitions 
s  t  caused  by  each  action  a  of  A.  A  behavior  of  the  form  (1)  is  in  £ 
only  if:  (i)  each  transition  s,_i  ^  s,-  is  a  possible  state  transition  of  a,-,  and 
(ii)  it  is  either  infinite  or  it  terminates  in  a  state  in  which  no  further  action 
is  possible. 

Formally,  one  defines  a  relation  r(a)  on  S  for  each  action  a  of  A,  where 
(s,<)  €  r(a)  if  and  only  if  executing  the  action  a  in  state  s  can  produce 
state  t.  The  action  a  is  said  to  be  enabled  in  state  s  if  there  exists  some 
state  t  with  (s,t)  €  r(a).  For  ex?”'ple,  the  operation  send  m  to  j  in  process 
i’s  code  is  represented  by  the  action  a  such  that  (s,t)  is  in  r(a)  if  and  only 
if  s  is  a  state  in  which  control  in  process  t  is  at  operation  a  and  t  is  the 
same  as  s  except  with  m  added  to  the  queue  of  messages  from  t  to  j  and 
with  control  in  process  t  at  the  next  operation  after  o;  this  action  is  enabled 
if  and  only  if  control  is  at  the  operation  and  the  message  queue  is  not  full. 
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The  behavior  (1)  is  in  E  only  if:  (i)  (sj_i,s<)  €  r(ai)  for  all  t,  and  (ii)  the 
sequence  is  either  infinite  or  ends  in  a  state  Sn  in  which  no  action  is  enabled. 
Observe  that  condition  (i)  is  a  safety  property. 

In  this  definition,  we  include  in  E  behaviors  that  start  in  any  arbitrary 
state,  including  intermediate  states  one  expects  to  encounter  only  in  the 
middle  of  a  computation  and  states  that  cannot  occur  in  any  computation. 
The  properties  one  proves  are  of  the  form:  if  a  behavior  starts  in  a  certain 
initial  state,  then _  For  example,  the  set  E  for  a  mutual  exclusion  al¬ 

gorithm  includes  behaviors  starting  with  several  processes  in  their  critical 
section.  It  is  customary  to  include  in  the  description  a  set  of  valid  initial 
states,  and  to  include  in  E  only  those  behaviors  starting  in  such  a  state. 
However,  we  find  it  more  convenient  not  to  assume  any  preferred  starting 
states  because,  as  we  shall  see,  when  proving  liveness  properties  one  must 
reason  about  the  system’s  behavior  starting  from  a  point  in  the  middle  of 
the  computation. 

In  addition  to  satisfying  the  two  conditions  above,  sequences  in  E  are 
usually  required  to  satisfy  some  kind  of  fairness  condition.  For  example,  one 
may  require  that  the  sequence  contsdn  infinitely  many  actions  from  every 
process  unless  a  point  is  reached  after  which  no  further  actions  of  the  process 
are  enabled.  This  condition  is  expressed  more  formally  by  requiring  that  for 
every  process  k,  either  infinitely  many  of  the  Oi  are  actions  of  k  or  else  there 
is  some  n  such  that  no  action  of  k  is  enabled  in  any  state  s,-  with  t  >  n. 
Fairness  conditions  are  liveness  properties. 

In  practice,  fmrness  conditions  do  not  affect  the  safety  properties  of  a 
system.  This  means  that  if  all  behaviors  in  the  set  E  described  by  a  program 
satisfy  a  safety  property  C,  then  all  behaviors  satisfying  only  condition  (i), 
with  no  fairness  requirement,  also  satisfy  C.  Intuitively,  safety  properties 
are  assertions  about  any  arbitrarily  long  finite  portion  of  the  behavior,  while 
liveness  properties  restrict  only  the  infinite  behavior.  One  can  easily  devise 
fairness  conditions  that  affect  safety  properties — for  example,  the  fairness 
requirement  that  every  process  executes  infinitely  many  actions  implies  the 
safety  property  that  no  process  ever  reaches  a  halting  state.  However,  such 
fairness  conditions  are  unnatural  and  are  never  assumed  in  practice. 

For  reasoning  about  safety  properties,  one  can  therefore  ignore  condition 
(ii)  and  fairness  conditions  and  consider  only  the  relations  r(a)  defined  by 
the  actions.  (In  fact,  (ii)  really  is  a  fairness  condition.)  Conversely,  formal 
models  that  do  not  include  fairness  conditions  are  suitable  only  for  studying 
safety  properties,  not  liveness  properties. 

One  can  express  conditions  (i)  and  (ii)  and  the  fairness  conditions  in  a 
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suitably  chosen  formal  system.  Expressing  them  in  this  way  provides  an  ax¬ 
iomatic  semantics  for  constructive  descriptions,  meaning  that  every  program 
description  in  the  form  of  a  program  can  be  translated  into  a  collection  of 
axioms.  Thus,  constructive  descriptions  can  be  viewed  as  a  special  class  of 
axiomatic  ones.  In  particular,  we  can  adopt  the  simple  approach  to  formal 
reasoning  in  which  a  program  satisfies  a  property  C  if  and  only  if  I-  C  is 
valid,  where  S  is  the  translation  of  the  program  as  a  set  of  axioms.  While 
this  approach  provides  a  formal  definition  of  what  it  means  for  a  program 
to  satisfy  a  property,  it  does  not  necessarily  provide  a  practical  method  for 
reasoning  about  programs  because  the  axioms  derived  from  conditions  (i) 
and  (ii)  may  be  too  complicated. 

3.4  Assertional  Reasoning 

Verifying  that  a  system  satisfies  a  property  C  means  showing  that  every 
behavior  satisfying  the  definition  of  system  behaviors  also  satisfies  C.  The 
obvious  way  of  doing  this  is  to  reason  directly  about  sequences,  using  either 
a  temporal  logic  or  direct  mathematical  reasoning  about  sequences.  The 
problem  with  such  an  approach  is  that  concurrent  systems  can  exhibit  a 
wide  variety  of  possible  behaviors.  Reasoning  directly  about  behaviors  can 
become  qmte  complex,  with  many  different  cases  to  consider.  It  is  not  clear 
if  there  are  satisfactory  methods  for  coping  with  this  complexity. 

Assertional  methods  attempt  to  overcome  this  difficulty  by  reducing 
the  problem  of  reasoning  about  concurrent  systems  to  that  of  reasoning 
separately  about  each  individual  action.  In  an  assertional  method,  attention 
is  concentrated  on  the  states.  A  behavior  is  considered  to  be  a  (finite  or 
infinite)  sequence  of  states  so  — »  sj  — ♦  S2  — ♦  •  •  •  and  properties  are  expressed 
in  terms  of  state  predicates — boolean-valued  functions  on  the  set  of  states. 
Safety  and  liveness  properties  are  handled  by  separate  techniques. 

3.4.1  Simple  Safety  Properties 

It  is  convenient  to  introduce  a  bit  of  temporal  logic  to  express  properties. 
We  interpret  a  state  predicate  P  as  an  assertion  about  behaviors  by  defining 
P  to  be  true  for  a  behavior  if  and  only  if  it  is  true  for  the  first  state  of  the 
behavior.  We  define  DP  to  be  the  assertion  that  is  true  for  a  behavior  if 
and  only  if  P  is  true  for  all  states  in  the  behavior,  so  □  P  asserts  that  P  is 
“always”  true. 

Traditional  assertional  methods  prove  safety  properties  of  the  form  P  ^ 


17 


□Q  for  state  predicates  P  and  Q.  Most  safety  properties  that  have  been 
considered  are  of  this  form,  with  P  being  the  predicate  asserting  that  pro¬ 
gram  control  is  at  the  beginning  and  all  program  variables  have  their  correct 
initial  values.  For  example,  partial  correctness  is  expressed  by  letting  Q  be 
the  predicate  asserting  that  if  control  is  at  the  end  then  the  variables  have 
the  correct  final  values,  and  mutual  exclusion  is  expressed  by  letting  Q  be 
the  predicate  asserting  that  no  two  processes  are  in  their  critical  sections. 
Proving  such  a  property  means  showing  that  a  certain  class  of  states  in  S, 
namely  the  states  in  which  Q  is  false,  do  not  appear  in  any  behaviors  in  S 
that  be^n  in  a  state  with  P  true. 

We  say  that  a  state  predicate  /  is  an  invariant  of  a  system  if  no  action 
in  A  can  make  I  false.  More  formally,  /  is  an  invariant  if  and  only  if  for 
every  action  a  in  A  and  every  pair  (s,t)  in  r(a):  if  I{s)  is  true  then  I{t) 
is  true.  A  simple  induction  argument  shows  that  if  /  is  an  invariant  then 
I  =>  □/  is  true  for  every  behavior  in  E.  (What  we  call  an  invariant  is  also 
called  a  stable  property,  and  the  term  "invariant”  is  often  used  to  mean  a 
stable  property  that  is  true  of  the  initial  state.) 

In  assertional  methods,  one  proves  P  =»  dQ  by  finding  a  predicate  I 
such  that  (i)  /  is  an  invariant,  (ii)  P  implies  /,  and  (iii)  I  implies  Q.  Since 
the  invariance  of  I  means  that  I  =>  □/  is  true  for  every  sequence  in  E,  it 
follows  easily  from  (ii)  and  (iii)  that  P  ^  dQ  is  true  for  every  sequence  in 
E. 

As  a  simple  example,  consider  the  two-process  program  in  Figure  1, 
where  each  process  cycles  repeatedly  through  a  loop  composed  of  three  state¬ 
ments,  the  angle  brackets  enclosing  atomic  actions.  This  program  describes 
a  common  hardware  synchronization  protocol  that  ensures  that  the  two  pro¬ 
cesses  alternately  execute  their  critical  sections.  (For  simplicity  the  critical 
sections  are  represented  by  atomic  actions.)  We  prove  that  this  algorithm 
guarantees  mutual  exclusion,  which  means  that  control  is  not  at  the  critical 
section  statements  in  both  processes  at  the  S2une  time.^  Mutual  exclusion 
is  expressed  formally  as  the  requirement  P  =>  DQ,  where  the  predicates  P 
and  Q  are  defined  by 


P  =  at{a)  A  at{X) 

Q  =  ->(at(P)  A  at{p)) , 

at{a)  is  the  predicate  asserting  that  control  in  the  first  process  is  at  state- 

'This  protocol  does  not  solve  the  original  mutual  exclusion  problem  because  one  process 
cannot  progress  if  the  other  halts. 
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variables  x,  y  :  boolean; 

cobegin  loop  a:  { await  x  =  y); 

0:  {  critical  section  ); 
7:  {x  :=  -IX ) 
end  loop 

D 

loop  A:  (await  y  ^  x); 

(i:  ( critical  section  ); 

>'■  (y  •-  -’!/> 

end  loop 

coend 


Figure  1:  A  simple  syachrouization  protocol. 

ment  a,  and  the  other  “at”  predicates  are  similarly  defined. 

The  invariant  I  used  to  prove  this  property  is  dehned  by 

r  =  ((at(/3)  V  af(7))  =>  (x  =  y))  A  ((at(^)  V  at{t/))  =►  (x  y)) 

If  the  critical  sections  do  not  change  x  or  y,  then  executing  any  atomic  action 
of  the  program  starting  with  I  true  leaves  I  true,  so  /  is  an  invariant.  It  is 
also  easy  to  check  that  P  •=>  I  and  I  ^  Q,  which  imply  P  =»  DQ. 

The  method  of  proving  safety  properties  of  the  form  P  ^  UQ  can  be 
generalized  to  prove  properties  of  the  form  P  A  □  i?  =>  □  Q  for  predicates  P, 
Q,  and  P.  Such  properties  are  used  in  proving  liveness  properties.  We  say 
that  a  predicate  I  is  invariant  under  the  constraint  R  if  any  action  executed 
in  a  state  with  I A  R  true  leaves  /  true  or  makes  R  false.  If  /  is  invariant 
under  the  constraint  R,  then  I  A  nR  =>  D/is  true  for  every  behavior  in 
E.  One  cam  therefore  prove  PAOP=»D0by  finding  a  predicate  I  such 
that  (i)  /  is  an  invariant  under  the  constraint  R,  (ii)  P  implies  /,  and  (iii)  / 
implies  Q.  Thus,  the  ordinary  assertional  method  for  proving  P  UQ 
is  extended  to  prove  properties  of  the  form  PAOP^OQby  replacing 
invariance  with  invariance  under  the  constraint  R. 

The  hard  put  of  am  assertional  proof  is  constructing  I  and  verifying  that 
it  is  am  invariamt  (or  am  invariant  under  a  constraint).  The  predicate  I  can 
be  quite  complicated,  and  finding  it  can  be  difficult.  However,  proving  that 
it  is  an  invairiant  is  reduced  to  reaisoning  separately  about  each  individual 
action. 
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Experience  has  indicated  that  this  redaction  is  usually  simpler  and  more 
illuminating  than  reasoning  directly  about  the  behaviors  for  proving  safety 
properties  that  are  easily  expressed  in  the  form  P  ^  OQ.  However,  reason¬ 
ing  about  behaviors  has  been  more  successful  for  proving  properties  that  are 
not  easily  expressed  in  this  form.  It  is  usually  the  case  that  safety  properties 
one  proves  about  a  particular  algorithm  are  of  the  form  P  =>  OQ,  while 
general  properties  one  proves  about  classes  of  algorithms  are  not. 

Because  the  invariant  I  can  be  complicated,  one  wants  to  decompose 
it  and  further  decompose  the  proof  of  its  invariance.  This  is  done  by  the 
Omcki-Gries  method  [OG76],  in  which  the  invsiriant  is  written  as  a  pro¬ 
gram  annotation  with  predicates  attached  to  program  control  points.  In 
this  method,  I  is  the  conjunction  of  predicates  of  the  form  "If  program 
control  is  at  this  point,  then  the  attached  predicate  is  true.”  The  decom¬ 
position  of  the  invariance  proof  is  based  upon  the  following  principle:  if  I 
is  an  invariant  and  P  is  invariant  under  the  constraint  /  then  /  A  /'  is  an 
invariant. 

A  number  of  variations  of  the  Owicki-Gries  method  have  been  pro¬ 
posed,  usually  for  the  purpose  of  handling  particular  styles  of  interpro¬ 
cess  communication[AFdR80,  LG81].  These  methods  are  usually  described 
in  terms  of  proof  rules — the  individual  steps  one  goes  through  in  proving 
invariance — ^without  explicitly  mentioning  I  or  the  underlying  concept  of 
invariance.  This  has  tended  to  obscure  their  simple  common  foundation. 

3.4.2  Liveness  Properties 

If  P  and  Q  are  predicates,  then  P  Q  is  defined  to  be  true  if,  whenever  a 
state  is  reached  in  which  P  is  true,  then  eventually  a  state  will  be  reached 
in  which  Q  is  true.  More  precisely,  P  Q  is  true  for  the  sequence  (1)  if 
for  every  n,  if  P(s„)  is  true  then  there  exists  an  m  >  n  such  that  Q(sm)  is 
true.  Most  liveness  properties  that  one  wishes  to  prove  about  systems  are 
expressible  in  the  form  P  Q.  For  example,  termination  is  expressed  by 
letting  P  assert  that  the  program  is  in  its  starting  state  and  letting  Q  assert 
that  the  program  has  terminated;  lockout-freedom  is  expressed  by  letting 
P  assert  that  some  process  k  is  requesting  entry  to  its  critical  section  and 
letting  Q  assert  that  k  is  in  its  critical  section. 

The  basic  method  of  proving  liveness  properties  is  by  a  counting  argu¬ 
ment,  using  a  well-founded  set — one  with  a  partial  ordering  relation  >-  such 

that  there  are  no  infinite  chains  of  the  form  Ci  y  €2  X -  Suppose  we 

construct  a  function  w  from  the  set  of  states  to  a  well-founded  set  with  the 
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following  property:  if  the  system  is  in  a  state  s  in  which  Q(s)  is  false,  then  it 
must  eventually  reach  a  state  t  in  which  either  Q{t)  is  true  or  w(s)  >-  w(t). 
Since  the  value  of  w  cannot  decrease  forever,  this  implies  that  Q  must  even¬ 
tually  become  true. 

To  prove  P  Q,  we  construct  such  a  function  w  and  prove  that  it  has 
the  required  property — namdy,  that  its  value  must  keep  decreasing  unless 
Q  becomes  true.  In  this  proof,  we  may  assume  the  truth  of  any  predicate  R 
such  that  P  =>'  □  P  is  true  for  all  behaviors  in  E.  This  is  a  generalization  of 
the  usual  method  for  proving  termination  of  a  loop  in  a  sequential  program, 
in  which  to  decreases  with  each  iteration  of  the  loop  and  R  asserts  that  the 
loop  invariant^  is  true  if  control  is  at  the  start  of  the  loop. 

One  still  needs  some  way  of  proving  that  w  must  decrease  unless  Q 
becomes  true,  assuming  the  truth  of  a  predicate  R  that  satisfies  P  =>  nR. 
The  simplest  approach  is  to  prove  that  each  action  in  A  dther  decreases 
the  value  of  to  or  else  makes  Q  true — ^in  other  words,  that  for  every  action 
a  and  every  (s,t)  €  r(a):  R(s)  A  implies  id(s)  w(i)  V  Q(t)  V  -tR(t). 

This  approadi  works  only  if  the  validity  of  the  property  P  Q  does  not 
depend  upon  any  fairness  assumptions.  To  see  how  it  can  be  generalized  to 
handle  fairness,  consider  the  simple  fairness  assumption  that  if  an  action  is 
continuously  enabled,  then  it  must  eventually  be  executed — ^in  other  words, 
for  every  behavior  (1)  and  every  n  >  0:  if  a  is  enabled  in  all  states  Si  with 
t  >  n,  then  a  =  Oj  for  some  t  >  n.  Under  this  assumption,  it  suffices  to  show 
that  every  action  either  leaves  the  value  of  u;  unchanged  or  else  decreases  it, 
and  that  there  is  at  least  one  action  a  whose  execution  decreases  to,  where 
a  remains  enabled  until  it  is  executed.  Again,  this  need  be  proved  only 
under  the  assumption  that  Q  remains  false  and  R  remains  true,  where  R  is 
a  predicate  satisfying  P  ^  aR. 

The  problem  with  this  approach  is  that  the  precise  rules  for  reasoning 
depend  upon  the  type  of  fairness  assumptions.  An  alternative  approach 
uses  the  single  framework  of  temporal  logic  to  reason  about  any  kind  of 
fairness  conditions.  We  have  already  written  the  liveness  property  to  be 
i  proved  {P  Q)  and  the  safety  properties  used  in  its  proof  (properties  of 

the  form  P  □  P)  as  temporal  logic  formulas.  The  fairness  conditions  are 
also  expressible  as  a  collection  of  temporal  logic  formulas.  Logically,  all  that 
must  be  done  is  to  prove,  using  the  rules  of  temporal  logic,  that  the  fairness 

loop  invariuit  is  not  an  invariant  according  to  oui  definition,  since  it  asserts  only 
what  must  be  true  when  control  is  at  a  certain  point,  saying  nothing  about  what  must  be 
true  at  the  preceding  control  point. 
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conditions  and  the  safety  properties  imply  the  desired  liveness  property.  The 
problem  is  to  decompose  this  proof  into  a  series  of  simple  steps. 

The  decomposition  is  based  upon  the  following  observation.  Let  .4  be  a 
well-founded  set  of  predicates.  Suppose  that,  using  safety  properties  of  the 
form  P  =>  □£,  for  every  predicate  4  in  .4  we  can  prove  that 

A'^(Qw3A*€A:Ay  A') 

The  well-foundedness  of  A  then  implies  that  Q  must  eventually  become  true. 
This  decomposition  is  indicated  by  a  proof  lattice^  consisting  of  Q  and  the 
elements  of  A  connected  by  lines,  where  downward  lines  from  4  to  4i, . . . , 
An  denotes  the  assertion  4  4i  V  . . .  V  An. 

An  argument  using  a  proof  lattice  A  of  predicates  is  completely  equiva¬ 
lent  to  a  counting  argument  using  a  function  to  with  values  in  a  well-founded 
set;  either  type  of  argument  is  easily  translated  into  the  other.  These  count¬ 
ing  arguments  work  well  for  proving  liveness  properties  that  do  not  depend 
upon  fairness  assumptions.  When  fairness  is  required,  it  is  convenient  to  use 
more  general  proof  lattices  containing  arbitrskry  temporal  logic  formulas,  not 
just  predicates. 

To  illustrate  the  use  of  such  proof  lattices,  we  consider  the  mutual  ex¬ 
clusion  algorithm  of  Figure  2.  fbr  simplicity,  the  noncritical  sections  have 
been  eliminated  and  the  critical  sections  are  represented  by  atomic  actions, 
which  are  assumed  not  to  modify  x  or  y.  Under  the  fumess  assumption 
that  a  continuously  enabled  action  must  eventually  be  executed,  this  algo¬ 
rithm  guarantees  that  the  first  process  eventually  enters  its  critical  section. 
(However,  the  second  process  might  remain  forever  in  its  while  loop.)  The 
proof  that  the  algorithm  satisfies  the  liveness  property  of(a)  at(7)  uses 
the  proof  lattice  of  Figure  3.  The  individual  relations  represented  by  the 
lattice  are  numbered  and  are  explained  below. 

1.  at{a)  'v»  {at{P)  A  x)  follows  from  the  fairness  assumption,  since  action 
a  is  enabled  when  ot(a)  is  true. 

2.  This  is  an  instance  of  the  temporal  logic  tautology 

P'v>(g  V(FA  □-.Q)) 

which  is  valid  because  Q  either  eventually  becomes  true  or  else  re¬ 
mains  forever  false.  (We  are  using  linear-time  temporal  logic  [Erne, 
section  2.3].) 

^The  term  "proof  lattice”  is  used  even  though  A  need  not  be  a  lattice. 
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variables  x,  y  :  boolean; 

cobegin  loop  o:  (x  :=  true); 

0:  (await  -ly); 

7:  (  critical  section  ); 

(  X  :=  false  ) 
end  loop 

0 

loop  (y  :=  true); 

while  ( X  )  do  (  y  :=  false  ); 

A:  ( await  ->x ); 

(  y  :=  true  ) 
od 

(  critical  section  ); 

(  y  :=  false ) 
end  loop 

coend 

Figure  2:  A  simple  mutual  exclusion  algorithm. 
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Figure  3:  Proof  lattice  for  mutual  exclusion  algorithm. 
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3.  This  ^  relation  is  actually  an  implication,  asserting  that  if  the  first 
process  is  at  statement  /?  with  x  true  and  never  reaches  7,  then  it 
must  remain  forever  at  with  x  true.  This  implication  is  of  the  form 
(P  A  Dfi)  ^  OQ  and  is  proved  by  finding  an  invariant  under  the 
constraint  R,  as  explained  in  Section  3.4.1. 

4.  If  X  remains  true  forever,  then  the  fairness  assumption  implies  that 
control  in  the  second  process  must  eventually  reach  A  with  y  false.  A 
formal  proof  of  this  assertion  would  use  another  proof  lattice  in  which 
each  relation  represents  a  single  step  of  the  second  process. 

5.  This  is  another  property  of  the  form  (P  A  □!?)  OQ,  proved  by 
finding  an  invariant  under  the  constraint  R. 

6.  Action  13  is  enabled  when  at{0)  A  -ly  holds,  so  by  the  fairness  as¬ 
sumption,  □  (af(/3)  A  -ly)  implies  that  /3  must  eventually  be  executed, 
making  at(0)  false.  Since  □  at{/3)  asserts  that  at(l3)  is  never  false,  this 
is  a  contradiction. 

7.  false  implies  anything. 

The  proof  lattice  formalizes  a  simple  style  of  intuitive  reasoning.  Further 
examples  of  the  use  of  proof  lattices  can  be  found  in  [OL82]. 

Temporal  logic  appears  to  be  the  best  method  for  proving  liveness  prop¬ 
erties  that  depend  upon  fairness  assumptions.  There  seems  little  reason  to 
use  less  formal  methods  for  reasoning  about  behaviors,  since  such  reasoning 
can  be  expressed  compactly  and  precisely  with  temporal  logic.  However,  the 
verification  of  liveness  properties  has  received  less  attention  than  the  ver¬ 
ification  of  safety  properties,  and  any  conclusions  we  draw  about  the  best 
approach  to  verifying  liveness  properties  must  be  tentative. 

3.5  Deriving  Algorithms 

We  have  discussed  methods  for  reasoning  about  algorithms,  without  regard 
to  how  the  algorithms  are  developed.  There  is  increasing  interest  in  meth¬ 
ods  for  deriving  correct  algorithms.  Exactly  what  is  meant  by  “deriving”  an 
algorithm  varies.  It  may  consist  of  simply  developing  the  correctness  proof 
along  with  the  algorithm.  Such  an  approach,  based  upon  assertional  meth¬ 
ods  and  the  Unity  language,  is  taken  by  Chandy  and  Misra  [CM88].  At  the 
other  extreme  are  approaches  in  which  the  program  is  derived  automatically 
from  a  formal  specification  (Erne,  section  7.3]. 
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An  appealing  approach  to  the  development  of  correct  algorithms  is  by 
program  transformation.  One  starts  with  a  simple  algorithm  whose  correct¬ 
ness  is  obvious,  and  transforms  it  by  a  series  of  refinement  steps,  where  each 
step  yields  an  equivalent  program.  Perhaps  the  most  elegant  instance  of  this 
approach  is  Milner’s  Calculus  of  Communicating  Systems  (CCS)  [MilSO], 
where  refinement  steps  are  based  upon  simple  algebraic  laws.  However,  the 
simplicity  and  elegance  of  CCS  break  down  in  the  presence  of  fairness,  so 
CCS  is  not  well  suited  for  developing  algorithms  whose  correctness  depends 
upon  fairness. 

Methods  for  deriving  concurrent  algorithms  are  comparatively  new  and 
have  thus  far  had  only  limited  success.  Automatic  methods  can  derive  only 
simple,  finite-state  algorithms.  While  informal  methods  can  often  provide 
elegant  post  hoc  derivations  of  existing  algorithms,  it  is  not  clear  how  good 
they  are  at  deriving  new  algorithms.  Finding  efficient  algorithms — whether 
efficiency  is  judged  by  theoretical  complexity  measures  or  by  implementation 
in  a  real  system — is  stiU  an  art  rather  than  a  science.  We  still  need  to  verify 
algorithms  independently  of  how  they  are  developed. 

3.6  Specification 

To  determine  whether  an  algorithm  is  correct,  we  need  a  precise  specification 
of  the  problem  it  purports  to  solve.  In  the  classical  theory  of  computation, 
a  problem  is  specified  by  describing  the  correct  output  as  a  function  of  the 
input.  Such  an  input /output  function  is  inadequate  for  specifying  a  problem 
in  concurrency,  which  may  involve  a  complex  interaction  of  the  system  amd 
its  environment. 

As  discussed  above,  a  behavior  of  a  concurrent  system  is  usually  modeled 
as  a  sequence  of  states  and/or  actions.  A  specification  of  a  system — that 
is,  a  specification  of  what  the  system  is  supposed  to  do — consists  of  the 
set  of  all  behaviors  considered  to  be  correct.  Another  approach,  taken  by 
CCS  [MilSO],  is  to  model  a  concurrent  system  as  a  tree  of  possible  actions, 
where  branching  represents  nondeterminism.  The  specification  is  then  a 
single  tree  rather  than  a  set  of  sequences. 

With  any  specification  method,  there  arises  the  question  of  exactly  what 
it  means  for  a  particular  system  to  implement  a  specification.  This  is  a  very 
subtle  question.  Details  that  are  insignificant  for  sequential  programs  may 
determine  whether  or  not  it  is  even  possible  to  implement  a  specification  of 
a  concurrent  system.  Some  of  the  issues  that  must  be  addressed  are: 

•  No  system  can  function  properly  in  the  face  of  completely  arbitrary 
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behavior  by  the  environment.  How  can  an  implementation  specify  ap¬ 
propriate  constraints  on  the  environment  (for  example,  that  the  envi¬ 
ronment  not  change  the  program’s  local  variables)  without  “illegally” 
constraining  the  environment  (for  example,  by  preventing  it  from  gen¬ 
erating  any  input)? 

•  The  granularity  of  aiction  of  the  specification  is  usually  much  coarser 
than  that  of  the  implementation — ^for  example,  sending  a  message  may 
be  a  single  specification  action,  while  executing  each  computer  in¬ 
struction  is  a  separate  implementation  action.  What  does  it  mean  to 
implement  a  single  specification  action  by  a  set  of  lower-level  actions? 

•  The  granularity  of  data  in  the  specification  may  be  coarser  than  in 
the  implementation — for  example,  messages  versus  computer  words. 
What  does  it  mean  to  implement  one  data  structure  with  another? 

Space  does  not  permit  a  description  of  proposed  specification  methods  and 
how  they  have  addressed  (or  failed  to  address)  these  issues.  We  cam  only 
refer  the  reader  to  a  small  selection  from  the  extensive  literature  on  specifi¬ 
cation  [LS84,  Lam89,  LT87,  SM82]. 

4  Some  Typical  Distributed  Algorithms 

In  this  section,  we  discuss  some  of  the  most  significamt  algorithms  atnd  impos¬ 
sibility  results  in  this  airea.  We  restrict  our  attention  to  four  major  categories 
of  results:  shared  variable  algorithms,  distributed  consensus  algorithms,  dis¬ 
tributed  network  algorithms  and  concurrency  control.  Although  we  are  ne¬ 
glecting  many  interesting  topics,  these  four  areas  provide  a  representative 
picture  of  distributed  computing. 

In  early  work,  algorithms  were  presented  rather  informally,  without  for¬ 
mal  models  or  rigorous  correctness  proofs.  The  lack  of  rigor  led  to  errors, 
including  the  publication  of  incorrect  algorithms.  The  development  of  for¬ 
mal  models  and  proof  techniques  such  as  those  discussed  in  Section  3,  as  well 
as  a  generally  higher  standard  of  rigor,  has  made  such  errors  less  common. 
However,  algorithms  are  still  published  with  inadequate  correctness  proofs, 
and  synchronization  errors  are  still  a  major  cause  of  “crashes”  in  computer 
systems. 
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4.1  Shared  Variable  Algorithms 

Shared  variable  algorithms  represent  the  beginnings  of  distributed  comput¬ 
ing  theory,  and  many  of  the  ideas  that  are  importamt  elsewhere  in  the  area 
first  appear  here.  Today,  programming  languages  provide  powerful  synchro¬ 
nization  primitives  and  multiprocess  computers  provide  special  instructions 
to  simplify  their  implementation,  so  the  early  synchronization  algorithms  are 
seldom  used.  However,  higher-level  contention  and  cooperation  problems 
still  exist,  and  these  early  algorithms  provide  insight  into  these  problems. 

4.1.1  Mutual  Exclusion 

The  prototypical  contention  problem  is  that  of  mutual  exclusion.  Dijkstra 
[Dij65]  presents  a  mutual  exclusion  algorithm  which  uses  indivisible  read 
and  write  operations  on  shared  variables.  In  addition  to  ensuring  mutual 
exclusion,  the  algorithm  ensures  the  liveness  property  that  some  process 
eventually  enters  its  critical  section  if  there  are  any  contending  processes. 
Lockout  freedom  is  not  guaranteed;  the  system  might  grant  the  resource 
repeatedly  to  the  same  process,  excluding  another  process  forever.  This 
algorithm  is  significant  because  prior  to  its  discovery,  it  was  not  even  clear 
that  the  problem  could  be  solved. 

Dijkstra’s  algorithm  inspired  a  succession  of  additional  solutions  to  the 
mutual  exclusion  problem.  Some  of  this  work  improves  upon  his  algorithm 
by  adding  the  requirement  that  the  solution  be  fair  to  individual  processes. 
Fairness  can  take  several  forms.  The  strongest  condition  usually  stated  is 
FIFO  (first-in  first-out),  while  the  weakest  is  lockout  freedom.  There  are 
intermediate  possibilities:  there  might  be  an  upper  bound  on  the  number 
of  times  one  process  can  be  bypassed  by  another  while  it  is  waiting  for 
the  resource  (“bounded  waiting”),  or,  the  time  for  a  process  to  obtain  the 
resource  might  be  bounded  in  terms  of  its  own  step  time.  (These  last  two 
conditions  are  very  different:  the  former  is  an  egalitarian  condition  which 
tends  to  cause  all  processes  to  move  at  the  same  speed,  while  the  latter  tends 
to  allow  faster  processes  to  move  ahead  of  slower  processes.)  The  work  on 
mutual  exclusion  includes  a  collection  of  algorithms  satisfying  these  various 
fairness  conditions. 

An  interesting  example  of  a  mutual  exclusion  algorithm  is  Lamport’s 
“bakery  algorithm”  [Lam74],  so  called  because  it  is  based  on  the  processes 
choosing  numbers,  much  as  customers  do  in  a  bakery.  The  bakery  algorithm 
was  the  first  FIFO  solution,  and  it  was  the  first  solution  to  use  only  local 


shared  variables  (see  Section  2.2.1).  It  also  has  the  fault- tolerance  property 
that  if  a  process  stops  during  its  protocol,  and  its  local  shared  variables  sub¬ 
sequently  revert  to  their  initial  values,  then  the  rest  of  the  system  continues 
correctly  without  it.  This  property  permits  a  distributed  implementation 
that  tolerates  halting  failures. 

The  most  important  property  of  the  bakery  algorithm  is  that  it  was  the 
first  algorithm  to  implement  mutual  exclusion  without  assuming  lower-level 
mutual  exclusion  of  read  and  write  accesses  to  shared  variables.  Accesses 
to  shared  variables  may  occur  concurrently,  where  reads  that  occur  con¬ 
currently  with  writes  are  permitted  to  return  arbitrary  values.  Concurrent 
reading  and  writing  is  discussed  in  Section  4.1.4. 

Peterson  and  Fischer  [PF77]  contribute  a  complexity-theory  perspective 
to  the  mutual  exclusion  area.  They  describe  a  collection  of  algorithms  which 
include  strong  fairness  and  resiliency  properties,  and  which  also  keep  the  size 
of  the  shared  variables  small.  Of  particular  interest  is  their  “tournaunent 
algorithm”,  which  builds  am  n-process  mutual  exclusion  algorithm  from  a 
binaixy  tree  of  2-process  mutual  exclusion  algorithms.  They  also  describe  a 
useful  way  to  prove  bounds  on  time  complexity  for  asynchronous  parallel 
algorithms:  assuming  upper  bounds  on  the  time  for  certain  primitive  occur¬ 
rences  (such  as  process  step  time  and  time  during  which  a  process  holds  the 
resource),  they  infer  upper  bounds  on  the  time  for  occurrences  of  interest 
(such  am  the  time  for  a  requesting  process  to  obtain  the  resource).  Their 
method  cam  be  used  to  obtaiin  reasonable  complexity  bounds,  not  only  for 
mutuad  exclusion  algorithms,  but  also  for  most  other  types  of  asynchronous 
algorithms. 

The  development  of  many  different  fairness  amd  resiliency  conditions, 
amd  of  many  complex  algorithms,  gave  rise  to  the  need  for  rigorous  ways 
of  reasoning  about  them.  Burns  et  al.  [BJL*82]  introduce  formal  models 
for  shared-variable  algorithms,  and  use  the  models  not  only  to  describe 
new  memory-efRcient  algorithms,  but  also  to  prove  impossibility  results  and 
complexity  lower  bounds.  The  upper  amd  lower  bound  results  in  [BJL*82] 
axe  for  the  amount  of  shared  memory  required  to  achieve  mutual  exclusion 
with  various  fairness  properties.  The  particular  model  assumed  there  allows 
for  a  powerful  sort  of  access  to  shared  memory,  via  indivisible  “test  and  set” 
(combined  read  and  write)  operations.  Even  so.  Burns  and  his  coauthors  are 
able  to  prove  that  il(n)  different  values  of  shared  memory  are  required  to 
guarantee  fair  mutual  exclusion.  More  precisely,  guaranteeing  freedom  from 
lockout  requires  at  least  n/2  values,  while  guaranteeing  bounded  waiting 
requires  at  least  n  values. 
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The  lower  bound  proofs  in  [B  JL*82]  are  based  on  the  limitations  of  “local 
knowledge”  in  a  distributed  system.  Since  processes’  actions  depend  only 
on  their  local  knowledge,  processes  must  aict  in  the  same  way  in  all  com¬ 
putations  that  look  identical  to  them.  The  proofs  assume  that  the  shared 
memory  has  fewer  values  than  the  clsdmed  minimum  and  derive  a  contra¬ 
diction.  They  do  this  by  describing  a  collection  of  related  computations 
smd  then  using  the  limitation  on  shared  memory  size  and  the  pigeonhole 
principle  to  conclude  that  some  of  these  computations  must  look  identical 
to  certain  processes.  But  among  these  computations  are  some  for  which  the 
problem  specification  requires  the  processes  to  act  in  different  ways,  yield¬ 
ing  a  contradiction.  The  method  used  here — proving  that  actions  based  on 
local  knowledge  can  force  two  processes  to  act  the  same  when  they  should 
act  differently — ^is  the  fundamental  method  for  deriving  lower  bounds  and 
other  impossibility  results  for  distributed  algorithms. 

The  lower  bound  results  in  [BJL*82]  apply  only  to  deterministic  algo¬ 
rithms — that  is,  algorithms  in  which  the  actions  of  each  process  are  uniquely 
determined  by  its  local  knowledge.  Recently,  randomized  algorithms,  in 
which  processes  are  permitted  to  toss  fair  coins  to  decide  between  possible 
actions,  have  emerged  as  an  alternative  to  deterministic  algorithms.  A  ran¬ 
domized  algorithm  can  be  thought  of  as  a  strategy  for  “playing  a  game” 
against  an  “adversary”,  who  is  usually  assumed  to  have  control  over  the 
inputs  to  the  algorithm  and  the  sequence  in  which  the  processes  take  steps. 
In  choosing  its  own  moves,  the  adversary  may  use  knowledge  of  previous 
moves.  A  randomized  algorithm  should,  with  very  high  probability,  perform 
correctly  against  any  allowable  adversary. 

One  of  the  earliest  examples  of  such  a  randomized  algorithm  was  devel¬ 
oped  by  Rabin  [Rab82]  as  a  way  of  circumventing  the  limitations  proved  in 
[BJL*82].  The  shared  memory  used  by  Rabin’s  algorithm  has  only  0{logn) 
values,  in  contrast  to  the  0(n)  lower  bound  for  deterministic  algorithms. 
Rabin’s  algorithm  is  also  simpler  than  the  known  deterministic  mutual  ex¬ 
clusion  algorithms  that  use  0(n)-valued  shared  memory.  A  disadvantage 
is  that  Rabin’s  algorithm  is  not  solving  exactly  the  same  problem — it  is 
not  absolutely  guaranteed  to  grant  the  resource  to  every  requesting  process. 
Rather,  it  does  so  with  probability  that  grows  with  the  amount  of  time  the 
process  waits.  Still,  in  some  situations,  the  advantages  of  simplicity  and 
improved  performance  may  outweigh  the  small  probability  of  failure. 

The  mutual  exclusion  problem  has  also  been  studied  in  message-passing 
models.  The  first  such  solution  was  in  [Lam78],  where  it  was  presented  as  a 
simple  application  of  the  use  of  logical  clocks  to  totally  order  system  events 
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(see  Section  2.3).  Mutual  exclusion  was  reduced  to  the  global  consistency 
problem  of  getting  all  processes  to  have  a  consistent  view  of  the  queue  of 
waiting  processes.  More  recently,  several  algorithms  have  been  devised  which 
attempt  to  limit  the  number  of  messages  required  to  solve  the  problem.  A 
generalization  to  k-exclusion,  in  which  up  to  k  processes  can  be  in  their 
critical  section  at  the  same  time  has  also  been  studied. 

The  reader  can  consult  the  book  by  Raynal  [Ray86]  for  more  information 
and  more  pointers  into  the  extensive  literature  on  mutual  exclusion. 

4.1.2  Other  Contention  Problems 

The  dining  philosophers  problem  [Dij71]  is  an  important  resource  alloca¬ 
tion  problem  in  which  each  process  (“philosopher”)  requires  a  specific  set  of 
resources  (“forks”).  In  the  traditional  statement  of  the  problem,  the  philoso¬ 
phers  are  arranged  in  a  circle,  with  a  fork  between  each  pair  of  philosophers. 
To  eat,  each  philosopher  must  have  both  adjacent  forks.  Dijkstra’s  solution 
is  based  on  variables  (semaphores)  shared  by  all  processes,  and  thus  is  best 
suited  for  use  within  a  single  computer. 

One  way  to  restrict  access  to  the  shared  variables  is  by  associating  each 
variable  with  a  resource,  and  allowing  only  the  processes  that  require  that 
resource  to  access  the  variable.  This  arrangement  suggests  solutions  in  which 
processes  simply  visit  all  their  resources,  attempting  to  acquire  them  one  at 
a  time.  Such  a  solution  permits  deadlock,  where  processes  obtain  some 
resources  and  then  wait  forever  for  resources  held  by  other  processes.  In  the 
circle  of  dining  philosophers,  deadlock  arises  if  each  one  first  obtains  his  left 
fork  and  then  waits  for  his  right  fork. 

The  traditional  dining  philosophers  problem  is  symmetrical  if  processes 
are  identical  and  deterministic  and  all  variables  are  initialized  in  the  same 
way.  If  processes  take  steps  in  round-robin  order,  the  system  configuration 
is  symmetrical  after  every  round.  This  implies  that,  if  any  process  ever 
obtained  all  of  its  needed  resources,  then  every  process  would,  which  is 
impossible.  Hence,  there  can  be  no  such  completely  symmetric  algorithm. 
The  key  to  most  solutions  to  this  problem  is  their  method  for  breaking 
symmetry. 

There  are  several  ways  of  breaking  symmetry.  First,  there  can  be  a  single 
“token”  that  is  held  by  one  process,  or  circulated  around  the  ring.  To  resolve 
a  conflict,  the  process  with  the  token  relinquishes  its  resources  in  exchange 
for  a  guarantee  that  it  can  have  them  when  they  next  become  available. 
Second,  alternate  processes  in  an  even-sized  ring  can  attempt  to  obtain 
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their  left  or  right  resources  first;  this  strategy  can  be  used  not  only  to  avoid 
deadlock,  but  also  to  guarantee  a  small  upper  bound  on  waiting  time  for  each 
process.  Third,  Chandy  and  Misra  [CM84]  describe  a  scheme  in  which  each 
resource  has  a  priority  list,  describing  which  processes  have  stronger  claims 
on  the  resource.  These  priorities  are  established  dynamically,  depending 
on  the  demands  for  the  resources.  Although  the  processes  are  identical, 
the  initial  configuration  of  the  algorithm  is  asymmetric:  it  includes  a  set 
of  priority  lists  that  cannot  induce  cycles  among  wsuting  processes.  The 
rules  used  in  [CM84]  to  modify  the  priority  lists  preserve  acyclicity,  and  so 
deadlock  is  avoided. 

Finally,  Rabin  and  Lehmann  [RL81]  describe  a  simple  randomized  al¬ 
gorithm  that  uses  local  random  choices  to  break  symmetry.  Each  process 
chooses  randomly  whether  to  try  to  obtain  its  left  or  right  fork  first.  In 
either  case,  the  process  waits  until  it  obtains  its  first  fork,  but  only  tests 
once  to  see  if  its  second  fork  is  available.  If  it  is  not,  the  process  relinquishes 
its  first  fork  and  starts  over  with  another  random  choice.  This  strategy 
guarantees  that,  with  probability  1,  the  system  continues  to  make  progress. 

These  symmetry-breaking  techniques  avoid  deadlock  and  ensure  that  the 
system  makes  progress.  They  provide  a  variety  of  fairness  and  performance 
guarzmtees. 

4.1.3  Cooperation  Problems 

For  shared- variable  models,  cooperation  problems  have  received  less  atten¬ 
tion  than  contention  problems.  The  only  cooperation  problems  that  have 
been  studied  at  any  length  are  producer-consumer  problems,  in  which  pro¬ 
cesses  produce  results  that  are  used  as  input  by  other  processes.  The 
simplest  producer-consumer  problem  is  the  bounded  buffer  problem  (Sec¬ 
tion  2.3).  A  very  general  class  of  producer-consumer  problem  involves  the 
simulation  of  a  class  of  Petri  nets  known  as  marked  graphs  [CHEP71],  where 
each  node  in  the  graph  represents  a  process  and  each  token  represents  a 
value.  An  example  of  this  class  is  the  problem  of  passing  a  token  around 
a  ring  of  processes,  where  the  token  can  be  used  to  control  access  to  some 
resource. 

An  interesting  problem  that  combines  aspects  of  both  contention  and 
cooperation  is  concurrent  garbage  collection,  in  which  a  “collector”  process 
running  asynchronously  with  a  “mutator”  process  must  identify  items  in  the 
data  structure  that  are  no  longer  accessible  by  the  mutator  and  add  those 
items  to  a  “free  list”.  This  is  basically  a  producer-consumer  problem,  with 


32 


the  collector  producing  free-list  items  and  the  mutator  consuming  them. 
However,  the  problem  also  involves  contention  because  the  mutator  changes 
the  data  structure  while  the  collector  is  examining  it. 

In  shared-variable  models,  cooperation  problems  have  not  been  studied 
as  extensively  as  contention  problems,  probably  because  they  are  easier  to 
solve.  For  example,  in  concurrent  garbage  collection  algorithms,  it  is  the 
contention  for  access  to  the  data  structure  rather  than  the  cooperative  use 
of  the  free  list  that  poses  the  challenge.  However,  there  is  one  important 
property  that  is  harder  to  achieve  in  cooperation  problems  than  in  con¬ 
tention  problems — namely,  self-stabilization.  An  algorithm  is  said  to  be 
self-stabilizing  if,  when  started  in  any  arbitrary  state,  it  eventually  reaches 
a  state  in  which  it  operates  normally  [Dij74].  For  example,  a  self-stabilizing 
token-passing  algorithm  can  be  started  in  a  state  having  any  number  of 
tokens  and  will  eventually  reach  a  state  with  just  one  token  that  is  be¬ 
ing  passed  around.  It  is  generally  easy  to  devise  self-stabilizing  contention 
problems  because  processes  go  through  a  “home”  state  in  which  they  are 
reinitialized — for  example,  a  process  in  the  dining  philosopher  problem  even¬ 
tually  reaches  a  state  in  which  it  is  not  holding  or  requesting  any  forks — and 
the  whole  algorithm  is  reinitialized  when  every  process  has  reached  its  home 
state.  On  the  other  hand,  cooperation  problems  do  not  have  such  a  home 
state.  For  example,  the  symmetry  in  the  bounded  buffer  problem  means 
that  an  empty  buffer  and  a  full  buffer  are  symmetric  situations,  and  neither 
of  them  can  be  considered  a  “home”  state.  Dijkstra’s  self-stabilizing  token¬ 
passing  algorithms  [Dij74]  are  currently  the  only  published  self-stabilizing 
cooperation  algorithms. 

Self-stabilization  is  an  important  fault-tolerance  property,  since  it  per¬ 
mits  an  algorithm  to  recover  from  any  transient  failure.  This  property  has 
not  received  the  attention  it  deserves. 

4.1.4  Concurrent  Readers  and  Writers 

With  the  exception  of  the  bakery  algorithm,  all  of  the  work  we  have  de¬ 
scribed  so  far  assumes  that  processes  access  shared  memory  using  primitive 
operations  (usually  read  and  write  operations),  each  of  which  is  executed 
indivisibly.  The  ability  to  implement  multiple  processors  with  a  single  inte¬ 
grated  circuit  has  rekindled  interest  in  shared  memory  models  that  do  not 
assume  indivisibility  of  reads  and  writes.  Rather,  they  assume  that  opera¬ 
tions  on  a  shared  variable  have  duration,  that  reads  and  writes  that  do  not 
overlap  behave  as  if  they  were  indivisible,  but  that  reads  and  writes  that 
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overlap  caui  yield  less  predictable  results  [Laun86].  The  bakery  algorithm 
assumes  safe  shared  variables — ones  in  which  a  read  that  is  concurrent  with 
a  write  can  return  an  arbitrary  value  from  the  domain  of  possible  values  for 
the  variable.  Another  possible  assumption  is  a  regular  shared  variable,  in 
which  a  read  that  overlaps  a  write  is  guaranteed  to  return  either  the  old 
value  or  the  one  being  written;  however,  two  successive  reads  that  overlap 
the  same  write  may  obtain  first  the  new  value  then  the  old  one.  A  still 
stronger  assumption  is  an  atomic  shared  variable,  which  behaves  as  if  each 
read  and  each  write  occurred  at  some  fixed  time  within  its  interval. 

Using  safe,  regular,  or  atomic  shared  variables,  it  is  possible  to  simulate 
shared  variables  having  indivisible  operations,  so  that  algorithms  designed 
for  the  stronger  models  can  be  applied  in  the  weaker  models.  This  work 
has  evolved  from  the  traditional  readers-writers  algorithms  based  on  mutual 
exclusion  [CHP71],  through  nontraditional  algorithms  that  allow  concurrent 
reading  and  writing  [Pet83],  to  more  recent  algorithms  for  implementing  one 
class  of  shared  variable  with  a  weaker  class  [Lam86,  BP87,  BI088]. 

Recently,  Herlihy  (Her88]  has  considered  atomic  shared  variables  that 
support  operations  other  than  reads  and  writes.  He  has  shown  that  read- 
write  atomic  variables  cannot  be  used  to  implement  more  powerful  atomic 
shared  variables  such  as  those  supporting  test-and-set  operations.  He  has 
also  shown  that  other  types  of  atomic  variables  are  “universal”,  in  the  sense 
that  they  can  be  used  to  implement  atomic  shared  variables  of  arbitraury 
types.  Herlihy’s  impossibility  proof  proceeds  by  showing  that  atomic  read- 
write  shared  variables  cannot  be  used  to  solve  a  version  of  the  distributed 
consensus  problem  discussed  in  the  following  subsection. 

4.2  Distributed  Consensus 

Achieving  global  consistency  requires  that  processes  reach  some  form  of 
agreement.  Problems  of  reaching  agreement  in  a  message-passing  model 
are  called  distributed  consensus  problems.  There  are  many  such  prob¬ 
lems,  including  agreeing  (exactly  or  approximately)  on  values  from  some 
domain,  synchronizing  actions  of  different  processes,  and  synchronizing  soft¬ 
ware  clocks.  Distributed  consensus  problems  arise  in  areas  as  diverse  as 
real-time  process-control  systems  (where  agreement  might  be  needed  on  the 
values  read  by  replicated  sensors)  and  distributed  database  systems  (where 
agreement  might  be  needed  on  whether  or  not  to  accept  the  results  of  a 
transaction).  Since  global  consistency  is  what  makes  a  collection  of  pro¬ 
cesses  into  a  single  system,  distributed  consensus  algorithms  are  ubiquitous 
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in  distributed  systems. 

Consensus  problems  are  generally  easy  to  solve  if  there  are  no  failures; 
in  this  case,  processes  can  exchange  information  reliably  about  their  local 
states,  and  thereby  achieve  a  common  view  of  the  global  state  of  the  system. 
The  problem  is  considerably  harder,  however,  when  failures  are  considered. 
Consensus  algorithms  have  been  presented  for  almost  all  the  classes  of  failure 
described  in  Section  2.1.1. 

Distributed  consensus  problems  have  been  a  popular  subject  for  theoret¬ 
ical  research  recently,  because  they  have  simple  mathematical  formulations 
and  aie  surprisingly  challenging.  They  also  provide  a  convenient  vehicle  for 
comparing  the  power  of  models  that  make  different  assumptions  about  time 
and  failures. 

4.2.1  The  Two-Generals  Problem 

Probably  the  first  distributed  consensus  problem  to  appear  in  the  literature 
is  the  “two-generals  problem”  [Gra78],  in  which  two  processes  must  reaxdi 
agreement  when  there  is  a  possibility  of  lost  messages.  The  problem  is 
phrased  as  that  of  two  generals,  who  communicate  by  message,  having  to 
agree  upon  whether  or  not  to  attack  a  target.  The  following  argument 
can  be  formalized  to  show  that  the  problem  is  unsolvable  when  messages 
may  be  lost.  Reaching  at  least  one  of  the  two  possible  decisions,  say  the 
decision  to  attack,  requires  the  successful  arrival  of  at  least  one  message. 
Consider  a  scenario  D  in  which  the  fewest  delivered  messages  that  will  resrilt 
in  agreement  to  attack  aire  delivered,  and  let  S'  be  the  same  scenario  zis  S 
except  that  the  last  message  delivered  in  scenario  S  is  lost  in  S',  and  any 
other  messages  that  might  later  be  sent  are  also  lost.  Suppose  this  last 
message  is  from  general  A  to  general  B.  General  A  sees  the  same  messages 
in  the  two  scenarios,  so  he  must  decide  to  attack.  However,  the  minimality 
assumption  of  S  implies  that  B  cannot  also  decide  to  attack  in  scenario  S', 
so  he  must  make  a  different  decision.  Hence,  the  problem  is  unsolvable. 

4.2.2  Agreement  on  a  Value 

The  agreement  problem  requires  that  processes  agree  upon  a  value.  Com¬ 
munication  is  assumed  to  be  reliable,  but  processes  are  subject  to  failures 
(either  halting,  omission,  or  Byzantine).  Each  of  the  processes  begins  the 
algorithm  with  an  input  value.  After  the  algorithm  has  completed,  each 
process  is  to  decide  upon  an  output  vzdue.  There  are  two  constraints  on  the 
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solution:  (a)  (Agreement)  all  nonfaulty  processes  must  agree  on  the  out¬ 
put,  and  (b)  (Validity)  if  all  nonfaulty  processes  begin  with  the  same  input 
value,  that  value  must  be  the  output  value  of  all  nonfaulty  processes.  For 
the  case  of  Byzantine  faults,  this  problem  has  been  called  the  Byzantine 
generals  problem.  (Other,  equivalent  formulations  of  the  problem  have  also 
been  used.) 

In  the  absence  of  failures,  this  problem  is  easy  to  solve:  processes  could 
simply  exchange  their  values,  and  each  could  decide  upon  the  majority  value. 
The  fcdlowing  example  shows  the  kinds  of  difficulties  that  can  occur,  when 
failures  are  considered.  Consider  three  processes.  A,  B  and  C.  Suppose 
that  A  and  B  begin  with  input  0  and  1  respectively.  Suppose  that  C  is  a 
Byzantine  faulty  processor,  which  acts  toward  A  as  if  C  were  nonfaulty  and 
started  with  0,  but  as  if  B  were  faulty.  At  the  same  time,  C  acts  toward 
B  as  if  C  were  nonfaulty  and  started  with  1,  but  as  if  A  were  faulty.  Since 
A’s  view  of  the  execution  is  consistent  with  A  and  C  being  nonfaulty  and 
starting  with  the  same  input,  0,  A  is  required  to  decide  0.  Analogously,  B 
is  required  to  decide  1.  But  this  means  that  A  and  B  have  been  made  to 
disagree,  vicdating  the  agreement  requirement  of  the  problem.  This  example 
can  be  elaborated  into  a  proof  of  the  impossibility  of  reaching  agreement 
among  3t  processes  if  t  processes  might  be  faiilty. 

The  problem  of  reaching  agreement  on  a  value  was  studied  by  Pease, 
Shostak,  and  Lamport  [PSL80,  LSP82]  in  a  model  with  Byzantine  failures 
and  computation  performed  in  a  sequence  of  rounds.  (They  also  described 
the  implementation  of  rounds  with  synchronized  clocks.)  Besides  containing 
the  impossibility  proof  described  in  the  last  paragraph,  these  papers  also 
contain  two  subtle  algorithms.  The  first  is  a  recursive  algorithm  that  requires 
3t-i-l  processes  and  tolerates  Byzantine  faults.  The  seconH  requires  only  t-1-1 
processes,  but  assumes  distal  signatures  (Section  2.1.^ ).  Both  algorithms 
assume  a  completely  connected  network. 

Dolev  (Dol82]  considers  the  same  problem  in  an  arbitrary  network  graph. 
For  t  Byzantine  failures,  he  shows  how  to  implement  an  algorithm  similar 
to  that  of  [LSP82],  provided  that  the  network  is  at  le2ist  2t  -f  1-connected 
(and  has  at  least  3t  -f  1  processes).  He  also  proves  a  matching  lower  bound. 

A  series  of  results,  starting  with  (FL81]  and  culminating  in  [DM86], 
shows  that  any  synchronous  algorithm  for  reaching  agreement  on  a  value, 
in  the  presence  of  t  failures — even  the  simple  halting  failures — requires  at 
least  t  -f  1  rounds  of  message  exchange  in  the  worst  case.  As  usual,  these 
arguments  are  based  on  the  limitations  caused  by  local  knowledge  in  dis¬ 
tributed  algorithms;  by  assuming  fewer  rounds,  a  “chain”  of  computations 
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is  constructed  that  leads  to  a  contradiction.  In  the  first  computation  in 
the  chain,  nonfaulty  processes  are  constrained  by  the  problem  statement  to 
decide  0,  while  in  the  last  computation  in  the  chain,  nonfaulty  processes 
are  constrained  to  decide  1.  Further,  any  two  consecutive  computations  in 
the  chain  share  a  nonfaulty  process  to  which  the  two  computations  look 
the  same;  this  process  therefore  reaches  the  same  decision  in  both  compu¬ 
tations.  Hence,  all  nonfaulty  processes  decide  upon  the  same  value  in  every 
computation  in  the  chain,  which  yields  the  required  contradiction. 

Dwork  and  Moses  [DM86]  provide  explicit,  intuitive  definitions  for  the 
“knowledge”  that  individual  processes  have  at  any  time  during  the  execution 
of  an  algorithm.  Their  problem  statements,  algorithms,  and  lower  bound 
proofs  are  based  on  these  definitions.  This  work  suggests  that  formal  models 
and  logics  of  knowledge  may  provide  useful  high-level  ways  of  reasoning 
about  distributed  algorithms. 

Bracha  [Bra85]  is  able  to  circumvent  the  <  -I-  1  lower  bound  on  rounds 
with  a  randomized  algorithm;  his  solution  uses  only  O(logn)  rounds,  but 
requires  cryptographic  techniques  that  rest  on  special  assumptions.  More 
recently,  Feldman  and  Micali  [FM88]  have  improved  Bracha’s  upper  bound 
to  a  constant.  Chor  and  Coan  [CC84]  give  another  randomized  algorithm 
that  requires  0(t/logn)  rounds,  but  does  not  require  any  special  assump¬ 
tions. 

The  consensus  algorithms  mentioned  above  all  assume  a  synchronous 
model  of  computation.  Fischer,  Lynch,  and  Paterson  [FLP85]  study  the 
problem  of  reaching  agreement  on  a  value  in  a  completely  stsynchronous 
model.  They  obtain  a  surprising  fundamental  impossibility  result:  if  there 
is  the  possibility  of  even  one  simple  halting  failure,  then  am  asynchronous 
system  of  deterministic  processes  cannot  guarantee  agreement.  This  result 
suggests  that,  while  asynchronous  models  are  simple,  general,  and  popular, 
they  are  too  weak  for  studying  fault  tolerance. 

The  impossibility  result  is  proved  by  first  showing  that  any  asynchronous 
consensus  protocol  that  works  correctly  in  the  absence  of  faults  must  have  a 
reachable  configuration  C  in  which  there  is  a  single  “decider”  process  i — one 
that  is  capable,  on  its  own,  of  causing  either  of  two  different  decisions  to  be 
reached.  If  this  protocol  is  also  required  to  tolerate  a  single  process  failure, 
then,  starting  from  C,  all  the  processes  except  t  must  be  able  to  reach  a 
decision.  But,  this  decision  will  conflict  with  one  of  the  possible  decisions 
process  i  might  reach  on  its  own.  (Herlihy  used  a  similar  technique  to  prove 
the  impossibility  result  mentioned  in  the  previous  subsection.) 

There  are  several  ways  to  cope  with  the  limitation  described  in  [FLP85]. 
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One  can  simply  add  some  synchrony  assumptions — the  weakest  ones  com¬ 
monly  used  are  timers  and  bounded  message  delay.  Alternatively,  one  can 
use  an  asynchronous  deterministic  algorithm,  but  attempt  to  reduce  the 
probability  that  a  failure  will  upset  correct  behavior.  This  approach  is  some¬ 
times  used  in  practice  when  only  modest  reliability  is  needed,  but  there  has 
been  no  rigorous  attempt  to  analyze  the  reliability  of  the  resulting  system. 

Another  possibility  is  to  use  randomized  rather  than  deterministic  al¬ 
gorithms.  For  example,  Ben-Or  [Ben83]  gives  a  randomized  algorithm  for 
reaching  agreement  on  a  value  in  the  completely  asynchronous  model,  allow¬ 
ing  Byzantine  faults.  The  algorithm  never  permits  disagreement  or  violates 
the  validity  condition;  however,  instead  of  gusiranteeing  eventual  terminal 
tion,  it  guarantees  only  termination  with  probability  1. 

A  good  survey  of  the  early  work  in  this  area  appears  in  [Fis83]. 

4.2.3  Other  Consensus  Problems 

Other  distributed  consensus  problems  have  been  studied  under  the  assump¬ 
tion  that  processes  can  be  faulty  but  communication  is  reliable.  One  such 
problem  is  that  of  reaching  approximate,  rather  than  exact,  s^eement  on 
a  value.  Each  process  be^ns  with  an  initial  (infinite-precision)  real  value, 
and  must  eventually  decide  on  a  real  value  subject  to:  (a)  (Agreement)  all 
nonfaulty  processes’  decisions  must  agree  to  within  e,  and  (b)  (Validity)  the 
decision  value  for  any  nonfaulty  process  must  be  within  c  of  the  range  of  the 
initial  values  of  the  nonfaulty  processes.  Processes  are  permitted  to  send 
real  values  in  messages. 

Although  the  problems  of  exact  and  approximate  agreement  seem  to 
be  quite  similar,  reaching  approximate  agreement  is  considerably  easier;  in 
particular,  there  are  simple  deterministic  algorithms  for  approximate  agree¬ 
ment  in  asynchronous  models — even  in  the  presence  of  Byzamtine  faults.  It 
seems  almost  paradoxical  that  deterministic  processes  can  reach  agreement 
on  real  values  to  within  any  predetermined  t,  but  they  cannot  reach  exact 
agreement  on  a  single  bit. 

Another  consensus  problem  is  achieving  simtiltaneous  action  by  dis¬ 
tributed  processes,  in  a  model  with  timers  in  which  all  messages  take  ex¬ 
actly  the  same  (known)  time  for  delivery.  This  problem,  sometimes  called 
the  “distributed  firing  squad  problem”,  yields  results  very  similar  to  those 
for  agreement  on  a  value.  In  fact,  for  the  case  of  Byzantine  faults,  a  general 
transformation  converts  any  algorithm  for  agreement  to  an  algorithm  for  si¬ 
multaneous  action.  The  firing  squad  algorithm  is  obtained  by  running  many 


38 


instances  of  the  agreement  algorithm,  each  deciding  whether  the  processes 
should  Are  at  a  particular  time.  The  Arst  instance  that  reaches  a  positive 
decision  triggers  the  simultaneous  Aring  action. 

In  this  transformation,  many  instances  of  a  Byzantine  agreement  algo¬ 
rithm  are  executed  concurrently.  Those  instances  that  are  not  actually  car¬ 
rying  out  any  interesting  computation  can  be  implemented  in  a  trivial  way 
by  letting  all  of  their  messages  be  special  “null”  messages  that  are  not  actu¬ 
ally  sent.  This  trick  of  sending  a  message  by  not  sending  a  message  is  also 
used  in  [Lam84]  to  give  fault-tolerant  distributed  simulations  of  centralized 
algorithms. 

Another  consensus  problem  is  establishing  and  maintaining  synchronized 
local  clocks  in  a  distributed  system.  It  is  closely  related  to  both  of  the  pre¬ 
ceding  problems  (reaching  approximate  agreement  and  achieving  simultane¬ 
ous  action),  since  it  may  be  viewed  as  simultaneously  reaching  approximate 
agreement  on  a  dock  value,  or  as  reaching  exact  agreement  on  a  clock  value 
at  approximately  the  same  instant.  The  problem  is  one  of  implementing 
synchronized  docks  using  timers  that  run  at  approximately  the  same  rate, 
usually  assuming  initial  synchronization  of  the  clocks.  However,  it  is  gener¬ 
ally  described  in  terms  of  maintaining  the  synchronization  (to  within  e)  of 
the  processes’  clocks  despite  a  small,  varying  difference  in  their  clock  rates. 

Clock  synchronization  is  diiAcult  to  achieve  in  the  presence  of  faulty 
processes.  Mamy  algorithms  to  solve  this  problem  have  been  suggested, 
analyzed,  and  compared  in  the  literature,  and  some  have  been  used  in  im¬ 
plementing  systems.  In  most  algorithms  for  maintaining  synchronization 
among  docks  that  are  initially  synchronized,  a  new  round  is  begun  when 
the  docks  reach  predetermined  values.  In  each  round,  processes  exchange 
information  about  their  clock  values  and  use  the  information  to  adjust  their 
own  docks.  Synchronization  algorithms  that  do  not  assume  the  clocks  to 
be  initially  synchronized  use  other  methods,  since  they  cannot  depend  upon 
the  docks  to  determine  when  the  Arst  round  should  begin. 

Lower  bounds  and  impossibility  results  have  also  been  proved  for  clock 
synchronization  problems.  Of  particular  interest  is  the  result  of  Dolev, 
Halpern,  and  Strong  [DHS84]  showing  that  clock  synchronization  problems 
cannot  be  solved  for  3t  processes  if  t  of  them  can  exhibit  Byzantine  failures. 

This  impossibility  result  is  reminiscent  of  the  impossibility  result  de¬ 
scribed  earlier  for  agreement  on  a  value,  where  the  problem  cannot  be  solved 
with  3t  processes  in  the  presence  of  t  Byzantine  failures.  In  fact,  a  3t  versus 
t  impossibility  result  also  holds  for  many  other  consensus  problems  under 
Byzantine  failures,  including  approximate  agreement  and  simultaneous  ac- 
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Figure  4:  The  systems  T  and  H 


tion.  I^irthermore,  all  of  these  problems  are  unsolvable  in  network  graphs 
having  less  than  2t+ 1-connectivity.  These  impossibility  results  do  not  apply 
if  authentication  is  used. 

Since  aU  of  these  bounds  are  tight,  it  is  apparent  that  there  must  be  a 
common  reason  for  the  many  similar  results.  Fischer,  Lynch,  and  Merritt 
[FLM86]  tie  together  this  large  collection  of  impossibility  results  with  a 
common  proof  technique.  We  illustrate  this  technique  by  proving  the  3- 
versus-l  impossibility  result  for  reaching  agreement  on  a  value.  Assume 
for  the  sake  of  obtaining  a  contradiction  that  there  is  such  a  solution  for 
the  system  T  consisting  of  the  three  processes  A,  and  C  arranged  in  a 
triangle.  Let  be  a  new  system,  consisting  of  two  copies  of  each  of  A,  B 
and  C,  in  the  hexs^onal  arrangement  shown  in  Figure  4.  Note  that  system 
H  looks  locally  like  the  ori^nal  system  T. 

Let  £  be  a  computation  of  that  results  if  7i  is  run  with  each  of  its 
six  processes  behaving  exactly  like  the  corresponding  nonfaulty  process  of 
T.  Consider  any  pair  of  nonfaiilty  processes  in  H,  say  the  upper-right-hand 
copies  of  A  and  B.  There  is  a  computation  L'  of  T,  with  C  faulty,  in  which 
A  and  B  receive  the  same  inputs  as  they  do  in  the  computation  £.  By 
our  assumption,  A  and  B  agree  on  the  same  value  in  £'.  Since  the  copies 
of  A  and  B  have  the  same  view  in  £  as  their  namesakes  do  in  £',  they 
must  also  agree  on  the  same  value.  Moreover,  if  A  and  B  have  the  same 
input  value,  then  that  is  their  output  value.  Since  this  proof  works  for  any 
pair  of  adjacent  processes  in  H,  this  shows  that  in  any  computation  of  H: 
(a)  all  processes  must  agree  upon  the  same  output  value,  and  (b)  if  adjacent 
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processes  have  the  same  input  value  then  that  is  their  output  value.  Letting 
the  upper-right-hand  copies  of  A  and  B  have  input  values  of  1  and  the 
lower-left-hand  copies  of  A  and  B  have  input  values  of  0,  this  implies  that 
the  output  of  every  process  must  equal  both  0  and  1,  which  is  the  required 
contradiction.  Other  impossibility  results  are  proved  similarly,  using  slightly 
more  complicated  systems  for  H. 

4.2.4  The  Distributed  Commit  Problem 

The  transaction  commit  problem  for  distributed  databases  is  the  problem 
of  reaching  agreement,  among  the  nodes  that  have  participated  in  a  trans¬ 
action,  about  whether  to  commit  or  a6ort  the  transaction.  (We  say  more 
about  transactions  in  Section  4.4.)  The  requirements  are:  (a)  (Agreement) 
all  nonfaulty  processes’  decisions  must  agree,  and  (b)  (Validity)  (i)  if  any 
process’s  initial  value  is  “abort”,  then  the  decision  must  be  “abort”,  and 
(ii)  if  all  processes’  initial  values  are  “commit”  and  no  failure  occurs,  then 
the  decision  must  be  “commit”.  The  problem  has  traditionally  been  studied 
under  the  assumption  of  halting  failures  and  the  loss  of  individual  messages. 
The  impossibility  result  of  (FLP85]  implies  that  the  commit  problem  cannot 
be  solved  in  the  completely  asynchronous  model,  for  even  a  single  faulty 
process — even  with  reliable  communication.  The  impossibility  result  for  the 
two-generals  problem  implies  that  the  commit  problem  cannot  be  solved  if 
messages  can  be  lost,  even  if  message  delays  are  otherwise  bounded  and 
processes  are  reliable  and  have  synchronized  clocks. 

Most  commit  protocols,  such  as  the  popular  two-phase  commit  algo¬ 
rithm,  have  a  failure  window — a  period  during  the  computation  when  a 
single  halting  failure  can  prevent  termination.  Using  the  assumptions  that 
the  processes  have  synchronized  clocks  and  there  is  a  known  upper  bound 
on  message  delivery  time,  one  can  construct  a  commit  protocol  that  has 
no  failure  window  from  a  synchronous  algorithm  for  reaching  agreement  on 
a  value.  However,  the  synchronous  model  does  not  permit  communication 
failure,  so  the  loss  of  a  message  must  be  considered  to  be  a  failure  of  either 
the  sending  or  receiving  process.  The  three-phase  commit  protocol  of  Skeen 
[Ske82]  is  another  commit  protocol  without  a  failure  window;  it  assumes 
reliable  message  delivery  and  detectable  failures. 


4.3  Network  Algorithms 

We  now  describe  a  class  of  algorithms  for  message-passing  models,  which 
we  call  network  algorithms,  in  which  the  behavior  of  the  algorithm  depends 
strongly  on  the  network  topology.  Most  of  these  algorithms  aire  designed  to 
solve  problems  arising  in  communication  in  computer  networks.  They  usu¬ 
ally  assume  a  completely  asynchronous,  failure-free  model.  Most  of  them 
can  be  divided  into  two  categories,  which  we  call  static  and  dynamic.  Static 
algorithms  are  assumed  to  operate  in  fixed  networks  and  to  start  with  all 
their  inputs  available  at  the  beginning;  dynamic  algorithms  also  operate  in 
fixed  networks  but  receive  some  of  their  inputs  interactively.  Another  way  of 
viewing  the  distinction  is  that  static  algorithms  are  based  upon  unchanging 
information  in  the  initial  states  of  the  processes,  while  dynamic  algorithms 
use  changing  information  from  the  changing  state  of  the  application  pro¬ 
cesses.  A  network  problem  can  have  both  static  and  dynamic  versions,  but 
the  two  versions  are  usually  treated  separately  in  the  literature.  We  also 
consider  some  algorithms  designed  to  operate  in  changing  networks,  and 
some  algorithms  designed  to  ensure  reliable  message  delivery  over  a  single 
unreliable  link. 

4.3.1  Static  Algorithms 

Route-Determination  Algorithms  In  communication  networks,  it  is 
often  important  for  processes  that  have  local  information  about  the  speed, 
bandwidth,  and  other  costs  of  message  transmission  to  their  immediate  net¬ 
work  neighbors,  to  determine  good  routes  through  the  network  for  commu¬ 
nicating  with  distant  processes.  If  such  routes  are  to  be  determined  infre¬ 
quently,  it  may  be  useful  to  consider  the  static  problem  in  which  the  local 
information  is  assumed  to  be  available  initially  and  fixed  during  execution 
of  the  route-determination  algorithm. 

Different  applications  require  different  notions  of  what  constitutes  a 
“good”  set  of  routes  through  the  network.  For  example,  if  the  routes  are 
used  primarily  for  broadcasting  a  single  message  to  all  other  processes,  un¬ 
necessary  message  duplication  can  be  avoided  by  establishing  a  spanning 
tree  of  the  network.  If  a  weight  is  associated  with  each  link  in  the  network 
to  represent  the  cost  of  sending  a  message  over  that  link,  a  minimum-weight 
spanning  tree  {MST)  can  be  used  to  minimize  the  total  cost  of  the  broadcast. 

Gallager,  Humblet,  and  Spira  [GHS83]  present  an  efRcient  distributed 
algorithm  for  finding  a  minimum-weight  spanning  tree  in  a  network  with  n 
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nodes  and  e  edges.  The  algorithm  is  based  upon  the  following  two  obser¬ 
vations:  if  all  edge  weights  are  distinct,  then  the  MST  is  unique;  and  the 
minimum-weight  external  edge  of  any  subtree  of  the  MST  is  in  the  MST. 
The  algorithm  grows  the  MST  by  coalescing  fragments  until  the  complete 
MST  is  formed.  Initially,  each  node  is  a  fragment,  and  a  fragment  coalesces 
with  the  one  at  the  other  end  of  its  minimum-weight  external  edge. 

The  main  achievement  of  this  algorithm  is  to  keep  the  number  of  mes¬ 
sages  small.  Each  time  a  fragment  with  f  nodes  computes  its  minimum- 
weight  external  edge,  0(f)  messages  are  required.  Naively  coalescing  frag¬ 
ments  could  lead  to  as  many  as  fl(n^)  messages.  By  using  a  priority  scheme 
to  determine  when  fragments  are  permitted  to  coalesce,  this  algorithm  gen¬ 
erates  only  0(nlogn  -f  c)  messages. 

Although  the  basic  idea  is  simple,  the  algorithm  itself  is  quite  compli¬ 
cated.  Certain  simple  “high-level”  tasks,  such  as  determining  a  fragment’s 
minimum-weight  external  edge,  are  implemented  as  a  series  of  separate  steps 
occurring  at  different  processes.  The  steps  implementing  different  high- 
level  tasks  interleave  in  complicated  ways.  The  correctness  of  the  algo¬ 
rithm  is  not  obvious;  in  fact,  only  recently  have  careful  correctness  proofs 
appeared.  [GC88,  WLL88]  While  these  proofs  use  techniques  based  upon 
those  described  in  Section  3,  they  are  lengthy  and  difficult  to  check.  In  gen¬ 
eral,  network  algorithms  are  typicaTy  longer  and  harder  to  understand  thsm 
the  other  types  of  distributed  algorithms  we  are  considering,  and  rigorous 
correctness  proofs  are  seldom  given. 

Many  other  network  algorithms  are  also  designed  to  minimize  the  num¬ 
ber  of  messages  sent.  While  message  complexity  is  easy  to  define  and 
amenable  to  clean  upper  and  lower  bound  results,  time  bounds  may  be 
more  important  in  practice.  However,  there  have  so  far  been  few  upper  and 
lower  time  bounds  derived  for  network  problems. 

Other  route-determination  algorithms  have  been  proposed  for  finding 
MST’s  in  a  directed  graph  of  processes  [Hum83]  jutid  for  determining  other 
routing  structures,  such  as  the  set  of  shortest  paths  between  all  pairs  of  nodes 
and  breadth-first  and  depth-first  spanning  trees.  Also,  a  basic  lower  bound 
of  ft(c)  has  been  proved  for  the  number  of  messages  required  to  implement 
broadcast  in  an  arbitrary  network  [AGPV88]. 

Leader  Election  In  this  problem,  a  network  of  identical  processes  must 
choose  a  “leader”  from  among  themselves.  The  processes  are  assumed  to 
be  indistinguishable,  except  that  they  may  possess  unique  identifiers.  The 
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difficulty  lies  in  breaking  the  symmetry.  Solutions  can  be  used  to  implement 
a  fault- tolerant  token- passing  algorithm;  if  the  token  is  lost,  the  leader- 
election  algorithm  is  invoked  to  decide  which  process  should  possess  the 
token. 

Peterson  [Pet82]  has  devised  a  leader-election  algorithm  for  a  completely 
asynchronous  ring  of  processes  with  unidirectional  communication;  it  uses 
at  most  0(n log  n)  messages  in  the  worst  case.  On  the  other  hand,  Freder- 
ickson  and  Lynch  [FL87]  have  shown  that  at  least  fi(nlogn)  messages  are 
required  in  the  worst  case,  even  in  a  ring  having  synchronous  and  bidirec¬ 
tional  communication. 

These  results  would  characterize  the  message  complexity  in  the  impor¬ 
tant  special  case  of  a  ring  of  processes  but  for  an  interesting  technicality.  The 
Frederickson-Lynch  lower  bound  assumes  that  the  algorithm  uses  process 
identifiers  only  in  order  comparisons,  but  not  in  counting  or  more  general 
arithmetic  operations.  Almost  all  published  election  algorithms  satisfy  this 
^sumption.  The  lower  bound  also  holds  for  more  general  uses  of  identi¬ 
fiers  if,  for  each  ring  size,  the  algorithm  satisfies  a  uniform  time  bound, 
independent  of  the  process  identifiers.  Without  this  technical  assumption, 
the  problem  can  be  solved  with  only  0(n)  messages  by  an  algorithm  taking 
an  unbounded  amount  of  time  [FL87,  Vit84].  Although  unlikely  to  be  of 
practical  use,  this  algorithm  provides  an  interesting  extreme  time-message 
tradeoff. 

The  election  problem  has  been  solved  under  many  different  assumptions: 
the  network  can  be  a  ring,  a  complete  graph,  or  a  general  graph;  the  graph 
can  be  directed  or  undirected;  the  processes  might  have  unique  identifiers 
or  be  identical;  the  individual  processes  might  or  might  not  know  the  size 
or  shape  of  the  network;  the  algorithm  can  be  deterministic  or  randomized; 
communication  can  be  synchronous,  asynchronous,  or  partially  synchronous; 
and  failures  might  or  might  not  be  allowed.  The  problem  has  provided  an 
opportunity  to  study  a  single  problem  under  many  different  assumptions, 
but  no  general  principles  have  yet  emerged. 

Other  Problems  Other  static  problems  include  the  computation  of  func¬ 
tions,  such  as  the  median  and  other  order  statistics,  where  the  inputs  are  ini¬ 
tially  distributed.  Attiya  et  al.  [ASW88]  and  Abrahamson  et  al.  [AAHK86] 
have  obtained  especially  interesting  upper  and  lower  bound  results — many 
surprisingly  tight — about  the  number  of  messages  required  to  compute  func¬ 
tions  in  a  ring. 
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4.3.2  Dynamic  Algorithms 

Distributed  Termination  In  this  problem,  each  process  is  either  active 
or  inactive.  Only  an  active  process  can  send  a  message,  and  an  inactive 
process  can  be  ome  active  only  by  receiving  a  message.  A  termination  de¬ 
tection  algorithm  detects  when  no  further  process  activity  is  possible — that 
is,  when  all  processes  are  simultaneously  inactive  and  no  messages  are  in 
transit. 

This  problem  was  first  solved  by  Dijkstra  [DS80]  for  the  special  case 
in  which  the  application  program  is  a  “diffusing  computation” — one  where 
all  activity  originates  from  and  returns  to  one  controlling  process.  Other 
researchers  have  addressed  the  problem  of  detecting  termination  in  CSP 
programs;  because  CSP  programs  admit  the  possibility  of  deadlock  as  well 
as  normal  termination,  these  algorithms  must  also  recognize  deadlock.  Ter¬ 
mination  can  also  be  detected  using  global  snapshot  algorithms,  discussed 
later  in  this  section. 

Distributed  Deadlock  Detection  Here,  it  is  assumed  that  processes  re¬ 
quest  resources  and  release  them,  and  there  is  some  mechanism  for  granting 
resources  to  requesting  processes.  However,  resources  may  be  granted  in 
such  a  way  that  deadlock  results — for  example,  in  the  dining  philosophers 
problem,  each  philosopher  may  have  requested  both  forks  and  received  only 
his  right  fork,  so  the  system  is  deadlocked  because  no  one  can  obtain  his 
left  fork.  A  deadlock  detection  algorithm  detects  such  a  situation,  so  ap¬ 
propriate  corrective  action  can  be  taken — usually  forcing  some  processes  to 
relinquish  resources  already  granted. 

The  simplest  instance  of  the  problem,  in  which  each  process  is  waiting 
for  a  single  resource  held  by  another  process,  is  solved  by  detecting  cycles  of 
the  form  “A  is  waiting  for  a  resource  held  by  B,  who  is  waiting  for  a  resource 
. . .  held  by  A.”  Straightforward  cycle-detection  algorithms  can  be  applied, 
but  they  may  not  be  efficient.  A  more  complicated  solution  is  required  if 
process  requests  have  a  more  interesting  structure,  such  as  “any  one  of  a 
set  of  resources”  or  “any  two  from  set  S  and  any  one  from  set  T”.  In  such 
cases,  the  problem  may  involve  detecting  a  knot  or  other  graph  structure, 
instead  of  a  cycle. 

A  difficulty  in  designing  distributed  deadlock  detection  algorithms  is 
avoiding  the  detection  of  “false  deadlocks”.  Consider  a  ring  of  processes 
each  of  which  occasionally  requests  a  resource  held  by  the  next,  but  in 
which  there  is  no  deadlock.  An  algorithm  that  simply  checks  the  status  of 
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all  processes  in  some  order  could  happen  to  observe  every  process  when  it 
is  waiting  for  a  resource  and  incorrectly  decide  that  there  is  deadlock.  The 
algorithm  of  Chandy,  Misra,  and  Haas  [CHM83]  is  a  typical  algorithm  that 
does  not  detect  false  deadlocks. 

Global  Snapshots  The  global  state  of  a  distributed  system  consists  of  the 
state  of  each  process  and  the  messages  on  each  transmission  line.  A  global 
snapshot  algorithm  attempts  to  determine  the  global  state  of  a  system.  A 
trivial  algorithm  would  instantaneously  “freeze”  the  execution  of  the  sys¬ 
tem  and  determine  the  state  at  its  leisure,  but  such  an  algorithm  is  seldom 
feasible.  Moreover,  as  explained  in  Section  2.3,  determining  the  global  state 
requires  knowing  the  complete  temporal  ordering  of  events,  which  may  be 
impossible.  Therefore,  a  global  snapshot  algorithm  is  required  only  to  de¬ 
termine  a  global  state  that  is  consistent  with  the  known  temporal  ordering 
of  events.  This  is  sufficient  for  most  purposes.  This  problem  was  studied 
by  Chandy  and  Lamport  [CL85],  who  presented  a  simple  global  snapshot 
algorithm. 

A  global  snapshot  algorithm  can  be  used  whenever  one  wants  global  in¬ 
formation  about  a  distributed  system.  In  a  distributed  banking  system,  such 
an  algorithm  can  determine  the  total  amount  of  money  in  the  bank  without 
halting  other  banking  transactions.  Similarly,  one  can  use  a  global  snapshot 
algorithm  to  checkpoint  a  system  for  failure  recovery  without  halting  the 
system. 

A  general  class  of  applications  is  detecting  when  an  invariant  property 
holds.  Recall  that  an  invariant  is  a  property  of  the  state  which,  once  it  holds, 
will  continue  to  hold  in  all  subsequent  states.  (Invariants  of  distributed 
systems  are  often  called  “stable  properties”.)  Distributed  termination  and 
deadlock  are  invariants.  If  an  invariant  holds  in  the  consistent  state  observed 
by  a  global  snapshot  algorithm,  then  it  also  holds  in  all  global  states  reached 
by  the  system  after  the  snapshot  algorithm  terminates.  Thus,  one  can  de¬ 
tect  termination,  deadlock,  or  any  other  invariant  property  by  obtaining  a 
consistent  global  snapshot  and  checking  it  for  that  property. 

A  global  snapshot  algorithm  can  transform  an  algorithm  for  solving  a 
static  network  problem  to  one  that  solves  the  dynamic  version  of  the  same 
problem.  For  example,  the  static  version  of  the  deadlock  detection  problem, 
in  which  the  set  of  resources  held  and  requests  pending  never  changes,  is 
easier  to  solve  than  the  dynamic  version  because  there  is  no  possibility  of 
detecting  false  deadlocks.  The  harder  dynamic  version  can  be  solved  by 
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taking  a  global  snapshot,  then  running  an  algorithm  for  the  static  problem 
on  the  state  determined  by  that  snapshot.  It  is  not  necessary  to  collect  the 
global  snapshot  information  in  one  place;  the  static  deadlock  detection  can 
be  done  with  a  distributed  algorithm.  This  strategy  is  used  in  a  deadlock 
detection  algorithm  by  Bracha  and  Toueg  [BT87]. 

Synchronizers  Many  simple  algorithms  have  been  designed  for  strongly 
synchronous  networks — ones  in  which  the  entire  computation  proceeds  in  a 
series  of  rounds.  A  network  synchronizer  is  a  program  designed  to  convert 
such  an  algorithm  to  one  that  can  run  in  a  completely  asynchronous  net¬ 
work.  Awerbuch  [Awe85]  has  designed  a  collection  of  network  synchronizers, 
varying  in  thrir  message  and  time  complexity.  They  have  been  used  to  pro¬ 
duce  asynchronous  algorithms  that  are  more  efficient  than  previously  known 
ones  for  breadth-first  search  and  the  determination  of  maximum  flows  and 
shortest  paths. 

The  simplest  of  the  synchronizers  transforms  an  algorithm  for  a  syn¬ 
chronous  network  into  an  asynchronous  algorithm  that  has  approximately 
the  same  execution  time.  This  seems  to  imply  that  any  problem  can  be 
solved  just  as  quickly  in  asynchronous  networks  as  in  synchronous  networks. 
However,  Arjomandi,  Fischer,  and  Lynch  [AFL83]  showed  that  there  are 
some  problems  whose  solution  requires  much  more  time  (greater  by  a  mul¬ 
tiplicative  factor  of  the  network  diameter)  in  an  asynchronous  than  in  a 
synchronous  network.  A  typical  problem  is  for  all  nodes  to  perform  a  se¬ 
quence  of  outputs,  in  such  a  way  that  every  node  does  its  output  before 
any  node  does  its  (i  -b  !)■*.  A  synchronous  system  can  perform  r  such  out¬ 
put  rounds  in  time  r,  but  an  asynchronous  system  requires  extra  time  for 
communication  between  all  the  nodes  in  between  each  pair  of  rounds. 

4.3.3  Changing  Networks 

The  algorithms  discussed  so  fair  in  this  subsection  are  designed  to  operate 
in  communication  networks  that  are  fixed  while  the  algorithm  is  being  ex¬ 
ecuted.  Algorithms  for  the  same  problems  are  also  required  for  the  harder 
case  where  network  links  may  fail  and  recover  during  execution — that  is,  for 
changing  networks. 

One  can  translate  any  algorithm  for  a  fixed  but  arbitrary  network  into 
one  that  works  for  a  changing  network  as  follows.  The  nodes  run  the  fixed- 
network  algorithm  as  long  as  the  network  does  not  appear  to  change.  When¬ 
ever  a  node  detects  a  change,  it  stops  executing  the  old  instance  of  the  fixed- 
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network  algorithm  and  begins  a  new  instance,  this  time  on  the  changed  net¬ 
work.  Thus,  there  can  be  many  instances  of  the  fixed-network  algorithm  ex¬ 
ecuting  simultaneously.  The  different  instances  are  distinguished  by  means 
of  "instance  identifiers”  attached  to  the  messages. 

It  is  not  difficult  to  implement  this  approach  using  an  unbounded  num¬ 
ber  of  instance  identifiers,  each  chosen  to  be  larger  than  the  previous  one 
used  the  node.  Afek,  Awerbuch,  and  Gafhi  [AAG87]  have  developed  a 
method  that  requires  only  a  finite  number  of  identifiers.  However,  simply 
bounding  the  number  of  instance  identifiers  is  of  little  practical  significance, 
since  practical  bounds  on  an  unbounded  number  of  identifiers  are  easy  to 
find.  For  example,  with  64-bit  identifiers,  a  systmn  that  chooses  ten  per 
second  and  was  started  at  the  beginning  of  the  universe  would  not  run  out 
of  identifies  for  several  billion  more  years.  However,  through  a  transient 
error,  a  node  might  choose  too  large  an  identifier,  causing  the  system  to  run 
out  of  identifiers  billions  of  years  too  soon — pehaps  within  a  few  seconds.  A 
self-stabilizing  a^rithm  using  a  finite  number  of  identifiers  would  be  quite 
n««eful,  but  we  know  of  no  such  algorithm. 

4.3.4  Link  Protoccds 

Links,  joining  nodes  in  a  network,  are  implemented  with  one  or  more  physical 
channels,  each  delivering  low-levd  messages  called  "packets”.  Packet  deliv¬ 
ery  is  not  necessarily  reliable.  A  link  protocol  is  used  to  implement  reliable 
message  communication  using  imreliable  physical  channels.  Of  course,  it  is 
necessary  to  make  some  assumptions  about  the  types  of  failures  permitted 
for  the  physical  channek.  For  example,  channels  might  be  assumed  to  lose 
and  reorder  messages,  but  not  to  duplicate  or  fabricate  them.  In  addition, 
some  liveness  assumption  on  the  physical  channel  is  needed  to  ensure  that 
messages  are  eventually  delivered;  a  common  assumption  is  that  if  infinitely 
many  packets  are  sent,  then  eventually  some  message  will  be  delivered. 

The  Alternating  Bit  Protocol  is  a  link  protocol  that  assumes  the  physical 
channel  may  lose  packets  but  cannot  reorder  them.  When  a  sender  wishes 
to  transmit  a  message,  it  assembles  a  packet  consisting  of  the  message  and 
a  single  bit  "header”,  and  transmits  this  packet  repeatedly  on  the  physical 
channel.  Upon  receipt  of  the  packet,  the  receiver  sends  the  header  bit  back 
to  the  sender.  When  the  sender  receives  a  hesuler  bit  that  is  the  same  as  the 
one  it  is  currently  transmitting,  it  knows  that  its  current  message  has  been 
received  and  switches  to  the  next  message,  using  the  opposite  header  bit. 

This  protocol  does  not  work  if  the  physical  channels  can  reorder  mes- 
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ss^es.  In  fact,  Lynch,  Mansour,  and  Fekete  [LMF88]  showed  that  no  pro¬ 
tocol  with  bounded  headers  can  work  over  non-FIFO  physical  channels,  if 
the  best-case  number  of  packets  required  to  deliver  each  message  must  be 
bounded.  Attiya  et  al.  [AFWZ89]  complete  the  picture  by  showing  that  this 
latter  assumption  is  necessary — that  there  is  a  (not  very  practical)  protocol 
using  bounded  headers  if  the  best-case  number  of  packets  required  to  deliver 
one  message  is  permitted  to  grow  without  bound. 

Baratz  and  Siegel  [BS88]  developed  link  protocols  that  tolerate  “crashes” 
of  the  participating  nodes,  with  loss  of  information  in  the  nodes’  states. 
Their  algorithm  requires  the  node  at  each  end  of  the  link  to  have  one  bit  of 
“stable  memory”  that  survives  crashes.  It  is  shown  in  [LMF88]  that  this  bit 
of  stable  memory  is  necessary. 

Aho  et  al.  [AUWY82]  have  studied  the  basic  capabilities  of  finite-state 
link  protocols. 

4.4  Concurrency  Control  in  Databases 

A  database  consists  of  a  collection  of  items  that  are  individually  read  and 
written  by  the  operations  of  programs  called  transactions.  A  concurrency 
control  algorithm  executes  each  transaction  so  it  either  acts  like  an  atomic 
action,  with  no  intervening  steps  of  other  transactions,  or  aborts  and  does 
nothing.  This  condition,  called  serializability,  ensures  that  the  system  acts 
as  if  all  transactions  that  are  not  aborted  are  executed  in  some  serial  order. 
This  or-ier  must  be  consistent  with  the  order  in  which  any  externally  visible 
actions  of  the  transactions  occur.  The  serializability  condition  for  databases 
is  very  similar  to  the  atomic  condition  discussed  earlier  for  shared  variables. 

By  making  transactions  appear  atomic,  concurrency  control  makes  the 
system  easier  to  understand.  For  this  reason,  atomic  transactions  have  been 
proposed  as  a  basic  construct  in  distributed  programming  languages  and 
systems  such  as  Argus  [Lis85]  and  Camelot  [STP*87]. 

Allowing  transactions  to  be  aborted  permits  more  efficient  concurrency 
control  algorithms.  An  algorithm  can  make  scheduling  decisions  that  lead  to 
faster  execution  but  may  produce  a  nonserializable  execution;  it  aborts  any 
transaction  whose  execution  would  not  appear  atomic.  It  is  sometimes  useful 
to  abort  a  transaction  for  reasons  other  than  maintaining  serializability.  The 
transaction  might  be  running  very  slowly  and  holding  needed  resources,  or 
the  person  who  submitted  the  transaction  could  change  his  mind  and  want 
it  aborted. 

The  simplest  concurrency  control  algorithm  is  one  that  actually  runs 
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the  transactions  serially,  one  at  a  time.  However,  such  an  algorithm  is  not 
satisfactory  because  it  eliminates  the  possibility  of  concurrent  execution  of 
tramsactions,  even  if  they  access  disjoint  sets  of  data  items. 

4.4.1  Techniques 

Hundreds  of  papers  have  been  written  about  concurrency  control  algorithms, 
and  many  techniques  have  been  proposed.  We  discuss  only  the  two  most 
popular  ones:  locking  and  timestamps.  We  refer  the  reader  to  the  text¬ 
book  by  Bernstein,  Hadzilacos,  and  Goodman  [BHG87]  for  a  more  complete 
survey  of  concurrency  control  algorithms  and  an  exposition  of  some  of  the 
underlying  theory,  and  to  Lynch  et  al.  [LMWF88,  LMWF90]  for  a  general 
theory  of  concurrency  control  algorithms. 

Locking  The  concurrency  control  method  used  most  often  in  commercial 
systems  is  locking.  A  locking  algorithm  requires  a  transaction  to  obtain  a 
lock  on  each  data  item  before  accessing  it,  preventing  conflicting  operations 
on  the  item  by  different  transactions.  There  axe  usually  two  kinds  of  locks: 
exclusive  locks  that  enable  the  owner  to  read  or  write  the  item,  and  shared 
locks  that  enable  the  owner  only  to  read  it.  Several  transactions  can  hold 
shared  locks  on  the  same  item,  but  a  transaction  cannot  have  an  exclusive 
lock  while  any  other  transaction  holds  either  kind  of  lock  on  that  item. 

In  a  classic  paper,  Eswaran  et  al.  [KPET76]  showed  that  seriaiizability 
is  guaranteed  by  two-phase  locking,  in  which  a  tramsaction  does  not  acquire 
any  new  locks  after  releasing  a  lock — ^for  example,  if  it  requests  all  locks 
at  the  beginning  and  releases  them  all  at  the  end.  However,  if  locks  aro 
acquired  one  at  a  time,  deadlock  is  possible  in  which  each  member  of  some 
set  of  transactions  is  waiting  for  a  lock  held  by  another  member  of  the  set. 
Such  deadlock  must  be  detected,  and  the  deadlock  broken  by  aborting  one 
or  more  waiting  transactions.  The  effects  of  any  aborted  transaction  must 
be  undone;  this  may  require  saving  the  original  values  of  all  data  items  that 
have  been  modified  by  transactions  which  have  not  yet  completed. 

The  notions  of  shared  and  exclusive  locks  can  be  generalized  to  other 
kinds  of  locks,  depending  on  the  semantics  of  the  operations  on  the  database — 
in  particular,  on  which  operations  commute.  These  other  classes  of  locks 
lead  to  more  general  and  efficient  concurrency  control  mechanisms  than 
ones  based  only  on  shared  and  exclusive  locks. 
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Timestamps  Instead  of  using  locks,  some  algorithms  use  timestamps  (de¬ 
scribed  in  Section  2.3)  to  control  access  to  data  items.  A  timestamp  is  as¬ 
signed  to  each  transaction,  and  the  algorithm  ensures  that  transactions  not 
aborted  are  executed  as  if  they  were  run  serially  in  the  order  specified  by 
their  timestamps.  This  serial  execution  order  is  obtained  if  operations  to 
the  same  item  are  performed  in  timestamp  order,  where  the  timestamp  of 
am  operation  is  defined  to  be  that  of  the  transaction  to  which  it  belongs. 
One  way  to  implement  this  condition  is  not  to  execute  an  operation  on  an 
item  until  all  operations  to  that  item  with  smaller  timestamps  have  been 
executed.  In  a  distributed  database,  waiting  until  no  operation  with  an  ear¬ 
lier  timestamp  can  arrive  may  be  expensive.  Alternatively,  one  can  abort  a 
transaction  if  it  tries  to  access  a  data  item  that  has  already  been  accessed 
by  a  transaction  with  a  later  timestamp. 

So  far,  we  have  assumed  that  only  a  single  version  of  the  item  is  main¬ 
tained.  Additional  flexibility  can  be  achieved  by  keeping  several  earlier 
versions  as  well,  since  it  is  no  longer  necessary  to  abort  a  transaction  when 
it  accesses  an  item  that  has  already  been  accessed  by  transactions  with  later 
timestamps.  For  example,  if  the  transaction  is  just  reading  the  item,  the 
serial  order  can  be  preserved  by  reading  an  earlier  version.  Some  of  these 
earlier  versions  might  be  needed  anyway  to  restore  the  item’s  value  if  a 
transaction  is  aborted. 

While  timestamps  seem  to  offer  some  adv2mtages  over  locking,  almost 
all  existing  database  systems  use  locking.  This  may  be  at  least  partly  due 
to  timestamp  algorithms  being  more  complicated  than  the  commonly  used 
locking  methods. 

4.4.2  Distribution  Issues 

In  distributed  systems,  items  can  be  located  at  multiple  sites.  A  concurrency 
control  sdgorithm  must  guarantee  that  all  sites  affected  by  a  transaction 
agree  on  whether  or  not  it  is  aborted.  This  agreement  is  obtained  by  a 
commit  protocol  (Section  4.2.4). 

Copies  of  an  item  may  be  kept  at  several  sites,  to  increase  its  availability 
in  the  event  of  site  failure  or  to  make  reading  the  item  more  efficient.  For 
transactions  to  appear  atomic,  they  must  provide  the  appearance  of  access¬ 
ing  a  single  copy.  One  method  of  ensuring  this  is  to  require  each  operation 
to  be  performed  on  some  subset  of  the  copies — a  read  using  the  most  recent 
value  from  among  the  copies  it  reads.  Atomicity  is  guaranteed  if  the  trans¬ 
actions  are  serialized  and  the  sets  of  copies  of  ainy  item  accessed  by  any  two 
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operations  have  at  least  one  element  in  common — for  example,  if  each  read 
reads  two  copies  and  each  write  writes  all  but  one  copy. 

4.4.3  Nested  Transactions 

The  concept  of  a  transaction  has  been  generalized  to  nested  transactions, 
which  are  transactions  that  can  invoke  subtransactions  as  well  as  execute 
operations  on  items.  The  nesting  is  described  as  a  tree;  each  transaction  is 
the  parent  of  the  subtransactions  it  invokes,  and  an  added  root  transaction 
serves  as  the  parent  of  all  top-level  transactions.  Serializability  is  generalized 
to  the  requirement  that  for  every  node  in  the  transaction  tree,  all  its  children 
appear  to  run  serially.  Algorithms  based  upon  locking  and  timestamps  have 
been  devised  for  implementing  this  more  general  condition. 

With  nested  transactions,  failures  can  be  handled  by  aborting  a  sub¬ 
transaction  without  aborting  its  parent.  The  parent  is  informed  that  its 
child  has  aborted  and  can  take  corrective  action.  Nested  transactions  ap¬ 
pear  as  a  fundamental  concept  in  the  Argus  distributed  programming  lan¬ 
guage  [Lis85]  and  in  the  Camelot  system  (STP*87]. 

The  framework  presented  in  [LMWF88]  is  general  enough  for  modeling 
nested  transactions  as  well  as  ordinary  single-level  transactions. 
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